A format for Graph Streaming

All questions about the GEXF (see http://gexf.net before)
User avatar
mbastian
Gephi Architect
Posts:728
Joined:10 Dec 2009 10:11
Location:San Francisco, CA
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable
A format for Graph Streaming

Post by mbastian » 08 Jun 2010 08:55

Hi all,

André (GSOC Student for Graph Streaming) and I started a discussion about how to format graphs in order to be streamed. Help is more than welcome, as this is a difficult question.

The Graph Streaming project aims to be able to stream data in and out Gephi, with the ideal use-case of two Gephi instances synchronizing over the network. That asks many questions and we though that is concerning the future of the GEXF format, as one of it's goal is to fulfill dynamic networks needs.

The question is simple, what format should we use to stream graphs over a network? The idea behind graph streaming is not only pushing, but also updates and deletes. Therefore we face a synchronization and serialization problem.

Some of global aims we identified for such a format
- The format should support graph topology and attributes
- It has to have an event model, where ADD, DELETE and UPDATE are event types.
- It could have additional events, like CLEAR, to avoid millions of deletes
- The format should be compact and minimize network transfer

About the serialization problem we think we could propose a GEXF format working with JSON. The idea is not to change GEXF but to propose a new format, inspired from GEXF but having different aims. JSON would lower the size of messages a lot and fit more to the "network world" than XML. Do you agree and how do you think that is possible? Please share your experience about JSON.

For synchronization issues, feel free to comment this point as well. Read the wiki page and imagine possible use cases. For instance if several instances of Gephi synchronize, how to make versionning and keep the data consistent and up to date everywhere? Do you have in mind other projects or articles that could help to see issues?

admin
Gephi Community Manager
Posts:964
Joined:09 Dec 2009 14:41
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by admin » 08 Jun 2010 09:25

Hi,

It seems that GEXF is not suitable for this kind of operations, because the format aims at stocking a whole graph in a file. I say that evident fact because the operations discussed above are far from questions like data representation. It's a grammar on how we can act on a graph that we need. If the GEXF is a language for describing an object (the graph), now we want a language for acting on it, whatever the operations are.

Filters were a first (and successful) tentative to draw such a grammar. Like André said, such language will contain:
* a vocabulary
* a grammar to rule how predicates works
* a network protocol (over HTTP)

AtomPub si also a good example on data publication protocol: http://bitworking.org/projects/atom/rfc5023.html
Note that Microsoft works on the specifications of OData: http://www.odata.org/

panisson
Gephi Plugin Developer
Posts:19
Joined:08 Apr 2010 18:15
Location:Turin
Contact:

Re: A format for Graph Streaming

Post by panisson » 08 Jun 2010 10:06

Hi all,

Thank you Mathieu for the introduction. I'll just like to add some words to contribute to this discussion about a Streaming format support.
As Mathieu pointed out in other discussion we had, there is an opensource project called GraphStream http://graphstream.sourceforge.net/ in which they had defined a format very suitable for streaming. Maybe we can learn a bit from their experience.
Their format is based on operations directed to graph elements (graph, nodes and edges), but an improvement to this format would be to add support to operations directed to groups of elements (all elements that satisfy some criteria). It could solve the problem pointed by Mathieu of additional events like CLEAR, to avoid millions of deletes - for example, we could remove all nodes that satisfy a criteria that is always true. This could be acquired through a Filter or Predicate format definition, representing filters or predicate objects in the specified format.
As I wrote in the wiki, I have a preference on the JSON format over the XML format, as it is more suitable for streaming graph transfers (the objects and events can be reconstructed as they arrive in the stream). Also other experiences like the Twitter streaming API (http://apiwiki.twitter.com/Streaming-AP ... gResponses) show that JSON would be more suitable for it, and they are even considering XML for deprecation.

I hope that this could help, and thank you all for your collaboration

André Panisson

User avatar
jbilcke
Gephi Core Developer
Posts:41
Joined:10 Dec 2009 17:48
Location:Paris, France
Contact:

Re: A format for Graph Streaming

Post by jbilcke » 08 Jun 2010 11:51

Hi everybody,

I agree to first discuss about the various use cases, synchronization strategies and data-sharing scenarios,
before dealing with the low-level of design like the data syntax or serialization format.

Once these needs will be clarified, I think it will be much easier for us to pickup the right tool and design the data model and protocol specifications,
because maybe, as Elias pointed out, existing libraries can help us in our task, and abstract low level things like serialization or communication issues.

Thus, I will start with a couple of questions :

- What kind of data scenario could we imagine ?

Mathieu talked about synchronization between different Gephi instances. How
- Do it imply real, or nearly real-time synchronization ?

- Should one end act as a client, the other as a server, or maybe both ?

- Could other programs (eg. a data-mining tool) communicate with Gephi too, through the same protocol ?

- Could these programs be written in different programming languages or paradigms ?
(eg. a desktop application, a multi-threaded server application, a basic javascript webpage, a distributed "cloud" application..).

I take compatibility with other programs (not other instances of Gephi) for granted, since this is a very common need (ie. Gephi as a client of another data server).
From my point of view, this imply platform and langage independence, and hopefully existing libraries, formats and protocols solve this problem.

I can think about these examples, a bit "long-term plan", but well we need brainstorming :
- a Python script written in a few lines of code, written by a text-mining researcher, that streams data to Gephi (faster than exporting gexf, opening gexf, closing gexf, modifying code, re-exporting gexf..)
- a data-mining application in Java or C++, that delivers real-time graphs to a desktop Gephi, on the same machine or maybe on the network
- a database that answers graph queries through a web API
- a Gephi instance on a laptop of a student, that connect to a Gephi located on a classroom server, open a "shared workspace" and start tagging nodes, removing edges.. then hit from time to time a "sync" button. Or let the teacher do it's course and add realtime filters, with a "slave" mode (yes, maybe I go a bit far here, and would never have the usage for myself, compared to other suggestions ;))
Last edited by jbilcke on 08 Jun 2010 11:56, edited 2 times in total.

elishowk
Posts:3
Joined:04 Jun 2010 13:14
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by elishowk » 08 Jun 2010 11:55

Here is my previous post from the parent thread : http://forum.gephi.org/viewtopic.php?f=9&t=94#p942

User avatar
mbastian
Gephi Architect
Posts:728
Joined:10 Dec 2009 10:11
Location:San Francisco, CA
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by mbastian » 14 Jun 2010 08:21

Thanks for your reply, Someone recommended me an interesting library: XStream for serialization. The library is a fast and lightweight Java serialization library which supports XML and JSON. When we will come up with the good language and a defined set of events with their parameter, using a serialization library may be a better choice than reinventing the wheel. Basically we would just create the Java event objets and serialize them.

User avatar
mbastian
Gephi Architect
Posts:728
Joined:10 Dec 2009 10:11
Location:San Francisco, CA
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by mbastian » 14 Jun 2010 13:00

To enlarge the discussion, it could be interesting to have a look at JMS (Java Message Service). It's widely use in companies to make loosely coupled architecture based on messages exchanged between producers and consumers. What we want to achieve is not so much different

User avatar
mbastian
Gephi Architect
Posts:728
Joined:10 Dec 2009 10:11
Location:San Francisco, CA
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by mbastian » 15 Jun 2010 18:31

I was thinking a bit about all the synchronization strategies, and came with some (ambitious) ideas, inspired by BigTable and Chubby papers by Google. I was thinking about this because I think basic client-server synchronization would cause problems and limitations if we want to scale our ideas. Let me develop and we will discuss these things together.

Google infrastructure has a master instead of a server. The master here is the directory and knows exactly the status of all clients. It knows if a client needs to refresh its data chunks and give order to another client close from him to transfer data. So data are transfered from peers to peers, avoiding the server bottleneck. Google adds to this system master election (with Chubby) and master replication, but it's not important for us.
What we want to do is replicate the same graph data on a list of clients. The master is just a role and it doesn't avoid the master machine to be a client as well.

The first thing we need is something to know if a client is out-to-date. Google uses a timestamp value and I think it's the best choice. If a client has an older timestamp than the most recent one, it has to be updated.
The second thing we need is to exactly identify which elements are out to date. For that we can associate an identifier and a timestamp. When a client updates some elements (nodes, edges, attributes) it sends the list of modified identifiers to the master and the master will ask the client to transfer new data to other clients.

I think this is a flexible system and could work wit most of the future uses cases. Let's think about them (please I need your help):

- Push only: A server is pushing graph data to a single Gephi instance
The Gephi instance or the server is the master and the push server is set as read-only.

- Collaborative working
A set of clients work on the same graph. Users are tagging nodes and therefore change some attributes. Attributes are synchronized between clients. If a client crash his Gephi, data are not lost.

- Distributed computing
The master is tuned to distribute partial graphs to client in order to perform distributed computing.

- Monitoring service
A daemon Java service is monitoring some system and maintain a graph structure. When the user wants to check the status of this graph he launches Gephi (on other machine) and connects to this client. The master starts and asks all clients to send a list of elements identifiers. Then, the master sees the Gephi client is out to date and asks the daemon client to transfer data. The user can now work with this graph and see it live changing also.

I notice here that to be part of this architecture, every client has to have the graph streaming library installed and working. A socket client coming from another system, which aims to push graph data to Gephi doesn't have this of course and would directly communicate with a client, not knowing the master. The client who would receive data will eventually dispatch his changes to the master and it should be fine.

User avatar
mbastian
Gephi Architect
Posts:728
Joined:10 Dec 2009 10:11
Location:San Francisco, CA
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by mbastian » 23 Jun 2010 06:30


admin
Gephi Community Manager
Posts:964
Joined:09 Dec 2009 14:41
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: A format for Graph Streaming

Post by admin » 23 Jun 2010 11:32

It seems fast but we must control the end-points. Clients are only available for Java and C++.

A criticism: http://blogs.tedneward.com/2008/07/11/S ... l+XML.aspx
"Protocol Buffers, as with any binary protocol format and/or RPC mechanism, are great for those situations where performance is critical and both ends of the system are well-known and controlled."

Some benchmarks: http://wiki.github.com/eishay/jvm-serializers/

BP/Thrift comparison: http://stackoverflow.com/questions/6931 ... ol-buffers

An interesting discussion: http://stackoverflow.com/questions/2966 ... -ejb-other

A Protocol Buffe plugin in NetBeans IDE: http://netbeans.dzone.com/news/intervie ... tocol-buff


But in a first time, I think we need to address the main issues because it's not only about serialization performances. I see:
  • shared schema evolution and backward support
  • communication over clients in different languages
  • transportation facilities
  • scalability

Post Reply
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable