Gephi forums

Re: A format for Graph Streaming

2010-07-13T12:10:22+01:00

For event identifiers you are right, I would just put it in the same JSON object:

CODE:

{    "id": "xyz",    "an": {        "A": {            "label": "Streaming Node A",            "size": 2        }    }}

In this way the event identifier cannot be confused with node/edge identifiers, and the absence of the "id" attribute should not be a problem when identifiers are not required.

For filters it is a bit more difficult, and I think you are right in the fact we could create special predefined filters for the most common filtering operations. As an example, the following Json object should change all nodes with id 1-100 to a size 3:

CODE:

{    "id": "xyz",    "cn": {        "filter": {            "range-id": {                "start-id": 1,                "end-id": 100             }         },        "attributes": {            "size": 3         }     } }

The problem with this format is that someone could confuse "filter" and "attributes" with object identifiers, and try to change nodes with these ids.
Other suggestions?

Statistics:Posted by panisson — 13 Jul 2010 12:10

Re: A format for Graph Streaming

2010-07-12T21:58:59+01:00

Hi!

For the event identifiers, I don't know if it is possible, but you could use the normal format and, when the events have an identifier, then they must be in the same HTTP request, just before the event. Maybe something like

CODE:

{"event-id":"xyz"}{"an":{"A":{"label":"Streaming Node A","size":2}}}

And as I see a format for complex events as applying changes using filters can be really difficult while not providing so much traffic or performance difference. So I think you could create special events for the most common of these events that delete multiple nodes based on their id range, for example. But for other multiple events, since I guess normally the JSON is going to be written by another program, sending a lot of events is no problem.

Statistics:Posted by eduramiba — 12 Jul 2010 21:58

Re: A format for Graph Streaming

2010-07-12T16:53:04+01:00

Hi all!

As Mathieu already pointed, in the last days I started with a simple JSON prototype. For the java serialization I'm using the implementation available at json.org (http://www.json.org/java/index.html).

As I said, this is a preliminary implementation, and is completely open to changes. It is just to give us some idea of how the events would appear in the JSON format. The description of the current implementation is available at http://wiki.gephi.org/index.php/Specifi ... ing_Format. I listed also some open problems and some ways to face it, and I would like to have your opinions about it:
- how to add support to filters
- how to support identifiers to events
- and other improvements (array-type attributes, composite attributes, graph attributes, etc. etc...)
I'll wait for your suggestions.

Statistics:Posted by panisson — 12 Jul 2010 16:53

Re: A format for Graph Streaming

2010-07-01T23:28:23+01:00

have you looked at BSON, http://bsonspec.org/, it's a binary version of JSON.
It's used as the format for mongodb and there are existing java, ruby, c++ libraries for de/serializing

Statistics:Posted by tcc — 01 Jul 2010 23:28

Re: A format for Graph Streaming

2010-06-23T22:35:17+01:00

Good point, thanks

We were discussing today with Andre and decided we will start with a JSON prototype, as our primary objective is portability. However as it was said the serialization technology is not the real problem, and other library like protobuf could be used and compatible, as far as the Java Objects which are serialized don't change.

So we have to find a serialization library able to serialize in JSON, some pointers?

Statistics:Posted by mbastian — 23 Jun 2010 22:35

Re: A format for Graph Streaming

2010-06-23T11:32:56+01:00

It seems fast but we must control the end-points. Clients are only available for Java and C++.

A criticism: http://blogs.tedneward.com/2008/07/11/S ... l+XML.aspx
"Protocol Buffers, as with any binary protocol format and/or RPC mechanism, are great for those situations where performance is critical and both ends of the system are well-known and controlled."

Some benchmarks: http://wiki.github.com/eishay/jvm-serializers/

BP/Thrift comparison: http://stackoverflow.com/questions/6931 ... ol-buffers

An interesting discussion: http://stackoverflow.com/questions/2966 ... -ejb-other

A Protocol Buffe plugin in NetBeans IDE: http://netbeans.dzone.com/news/intervie ... tocol-buff

But in a first time, I think we need to address the main issues because it's not only about serialization performances. I see:

shared schema evolution and backward support
communication over clients in different languages
transportation facilities
scalability

Statistics:Posted by admin — 23 Jun 2010 11:32

Re: A format for Graph Streaming

2010-06-23T06:30:46+01:00

An idea about serialization: Protocol Buffer ?

EDIT: And http://netbeans.dzone.com/news/intervie ... tocol-buff

Statistics:Posted by mbastian — 23 Jun 2010 06:30

Re: A format for Graph Streaming

2010-06-15T18:31:34+01:00

I was thinking a bit about all the synchronization strategies, and came with some (ambitious) ideas, inspired by BigTable and Chubby papers by Google. I was thinking about this because I think basic client-server synchronization would cause problems and limitations if we want to scale our ideas. Let me develop and we will discuss these things together.

Google infrastructure has a master instead of a server. The master here is the directory and knows exactly the status of all clients. It knows if a client needs to refresh its data chunks and give order to another client close from him to transfer data. So data are transfered from peers to peers, avoiding the server bottleneck. Google adds to this system master election (with Chubby) and master replication, but it's not important for us.
What we want to do is replicate the same graph data on a list of clients. The master is just a role and it doesn't avoid the master machine to be a client as well.

The first thing we need is something to know if a client is out-to-date. Google uses a timestamp value and I think it's the best choice. If a client has an older timestamp than the most recent one, it has to be updated.
The second thing we need is to exactly identify which elements are out to date. For that we can associate an identifier and a timestamp. When a client updates some elements (nodes, edges, attributes) it sends the list of modified identifiers to the master and the master will ask the client to transfer new data to other clients.

I think this is a flexible system and could work wit most of the future uses cases. Let's think about them (please I need your help):

- Push only: A server is pushing graph data to a single Gephi instance
The Gephi instance or the server is the master and the push server is set as read-only.

- Collaborative working
A set of clients work on the same graph. Users are tagging nodes and therefore change some attributes. Attributes are synchronized between clients. If a client crash his Gephi, data are not lost.

- Distributed computing
The master is tuned to distribute partial graphs to client in order to perform distributed computing.

- Monitoring service
A daemon Java service is monitoring some system and maintain a graph structure. When the user wants to check the status of this graph he launches Gephi (on other machine) and connects to this client. The master starts and asks all clients to send a list of elements identifiers. Then, the master sees the Gephi client is out to date and asks the daemon client to transfer data. The user can now work with this graph and see it live changing also.

I notice here that to be part of this architecture, every client has to have the graph streaming library installed and working. A socket client coming from another system, which aims to push graph data to Gephi doesn't have this of course and would directly communicate with a client, not knowing the master. The client who would receive data will eventually dispatch his changes to the master and it should be fine.

Statistics:Posted by mbastian — 15 Jun 2010 18:31

Re: A format for Graph Streaming

2010-06-14T13:00:40+01:00

To enlarge the discussion, it could be interesting to have a look at JMS (Java Message Service). It's widely use in companies to make loosely coupled architecture based on messages exchanged between producers and consumers. What we want to achieve is not so much different

Statistics:Posted by mbastian — 14 Jun 2010 13:00

Re: A format for Graph Streaming

2010-06-14T08:21:32+01:00

Thanks for your reply, Someone recommended me an interesting library: XStream for serialization. The library is a fast and lightweight Java serialization library which supports XML and JSON. When we will come up with the good language and a defined set of events with their parameter, using a serialization library may be a better choice than reinventing the wheel. Basically we would just create the Java event objets and serialize them.

Statistics:Posted by mbastian — 14 Jun 2010 08:21

Re: A format for Graph Streaming

2010-06-08T11:55:07+01:00

Here is my previous post from the parent thread : http://forum.gephi.org/viewtopic.php?f=9&t=94#p942

Statistics:Posted by elishowk — 08 Jun 2010 11:55

Re: A format for Graph Streaming

2010-06-08T11:56:11+01:00

Hi everybody,

I agree to first discuss about the various use cases, synchronization strategies and data-sharing scenarios,
before dealing with the low-level of design like the data syntax or serialization format.

Once these needs will be clarified, I think it will be much easier for us to pickup the right tool and design the data model and protocol specifications,
because maybe, as Elias pointed out, existing libraries can help us in our task, and abstract low level things like serialization or communication issues.

Thus, I will start with a couple of questions :

- What kind of data scenario could we imagine ?

Mathieu talked about synchronization between different Gephi instances. How
- Do it imply real, or nearly real-time synchronization ?

- Should one end act as a client, the other as a server, or maybe both ?

- Could other programs (eg. a data-mining tool) communicate with Gephi too, through the same protocol ?

- Could these programs be written in different programming languages or paradigms ?
(eg. a desktop application, a multi-threaded server application, a basic javascript webpage, a distributed "cloud" application..).

I take compatibility with other programs (not other instances of Gephi) for granted, since this is a very common need (ie. Gephi as a client of another data server).
From my point of view, this imply platform and langage independence, and hopefully existing libraries, formats and protocols solve this problem.

I can think about these examples, a bit "long-term plan", but well we need brainstorming :
- a Python script written in a few lines of code, written by a text-mining researcher, that streams data to Gephi (faster than exporting gexf, opening gexf, closing gexf, modifying code, re-exporting gexf..)
- a data-mining application in Java or C++, that delivers real-time graphs to a desktop Gephi, on the same machine or maybe on the network
- a database that answers graph queries through a web API
- a Gephi instance on a laptop of a student, that connect to a Gephi located on a classroom server, open a "shared workspace" and start tagging nodes, removing edges.. then hit from time to time a "sync" button. Or let the teacher do it's course and add realtime filters, with a "slave" mode (yes, maybe I go a bit far here, and would never have the usage for myself, compared to other suggestions )

Statistics:Posted by jbilcke — 08 Jun 2010 11:51

Re: A format for Graph Streaming

2010-06-08T10:06:31+01:00

Hi all,

Thank you Mathieu for the introduction. I'll just like to add some words to contribute to this discussion about a Streaming format support.
As Mathieu pointed out in other discussion we had, there is an opensource project called GraphStream http://graphstream.sourceforge.net/ in which they had defined a format very suitable for streaming. Maybe we can learn a bit from their experience.
Their format is based on operations directed to graph elements (graph, nodes and edges), but an improvement to this format would be to add support to operations directed to groups of elements (all elements that satisfy some criteria). It could solve the problem pointed by Mathieu of additional events like CLEAR, to avoid millions of deletes - for example, we could remove all nodes that satisfy a criteria that is always true. This could be acquired through a Filter or Predicate format definition, representing filters or predicate objects in the specified format.
As I wrote in the wiki, I have a preference on the JSON format over the XML format, as it is more suitable for streaming graph transfers (the objects and events can be reconstructed as they arrive in the stream). Also other experiences like the Twitter streaming API (http://apiwiki.twitter.com/Streaming-AP ... gResponses) show that JSON would be more suitable for it, and they are even considering XML for deprecation.

I hope that this could help, and thank you all for your collaboration

André Panisson

Statistics:Posted by panisson — 08 Jun 2010 10:06

Re: A format for Graph Streaming

2010-06-08T09:25:43+01:00

Hi,

It seems that GEXF is not suitable for this kind of operations, because the format aims at stocking a whole graph in a file. I say that evident fact because the operations discussed above are far from questions like data representation. It's a grammar on how we can act on a graph that we need. If the GEXF is a language for describing an object (the graph), now we want a language for acting on it, whatever the operations are.

Filters were a first (and successful) tentative to draw such a grammar. Like André said, such language will contain:
* a vocabulary
* a grammar to rule how predicates works
* a network protocol (over HTTP)

AtomPub si also a good example on data publication protocol: http://bitworking.org/projects/atom/rfc5023.html
Note that Microsoft works on the specifications of OData: http://www.odata.org/

Statistics:Posted by admin — 08 Jun 2010 09:25

A format for Graph Streaming

2010-06-08T08:55:55+01:00

Hi all,

André (GSOC Student for Graph Streaming) and I started a discussion about how to format graphs in order to be streamed. Help is more than welcome, as this is a difficult question.

The Graph Streaming project aims to be able to stream data in and out Gephi, with the ideal use-case of two Gephi instances synchronizing over the network. That asks many questions and we though that is concerning the future of the GEXF format, as one of it's goal is to fulfill dynamic networks needs.

The question is simple, what format should we use to stream graphs over a network? The idea behind graph streaming is not only pushing, but also updates and deletes. Therefore we face a synchronization and serialization problem.

Some of global aims we identified for such a format
- The format should support graph topology and attributes
- It has to have an event model, where ADD, DELETE and UPDATE are event types.
- It could have additional events, like CLEAR, to avoid millions of deletes
- The format should be compact and minimize network transfer

About the serialization problem we think we could propose a GEXF format working with JSON. The idea is not to change GEXF but to propose a new format, inspired from GEXF but having different aims. JSON would lower the size of messages a lot and fit more to the "network world" than XML. Do you agree and how do you think that is possible? Please share your experience about JSON.

For synchronization issues, feel free to comment this point as well. Read the wiki page and imagine possible use cases. For instance if several instances of Gephi synchronize, how to make versionning and keep the data consistent and up to date everywhere? Do you have in mind other projects or articles that could help to see issues?

Statistics:Posted by mbastian — 08 Jun 2010 08:55