(Dynamic) Statistics

Discussion about future features
Post Reply [phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable
admin
Gephi Community Manager
Posts:964
Joined:09 Dec 2009 14:41
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable
(Dynamic) Statistics

Post by admin » 17 Nov 2010 20:31

This thread will be a discussion on the features and metrics of dynamic statistics, and should lead to complete the current specifications and refactor the Statistics module.

As I now have real-world use cases of analyzing dynamical data, I have a little better overview on what is interesting to implement.

Metrics
There is a problem by considering the "(Dynamic) Degree Power Law" as a metric, because the power law is not a metric but a model which may fit the degree distribution. So the metric is "(Dynamic) Degree Distribution" because the question is "What is the distribution of the degrees over time?". Then one can try to fit it to a power law distribution or the normal distribution for example. So by computing the distribution of the degrees, the user should select the models to fit: power law and/or normal and they will appear in the report.

There are other "basic" distributions yet useful to analyze:
  • Number of nodes at each time-window.
  • Number of nodes in consecutive time-window.
  • Connected components at each time-window.
Reports - fitting distributions to models
The power law is not the only model one would like to test the fit. Gaussian model is also important. A Power-law indicates that the data is heterogeneous ; the Gaussian indicates that the data is homogeneous - always in a statistical point of view.

Plotting scales: it is important to visualize the distributions in 3 different scales, lin-lin, lin-log and log-log. They should be present every time also for static metrics.

Then one should evaluate the goodness of fit of the distributions with the MLE or its distance with the KS test so that one have a clue if there is really a power law or not. This test should be added for statics metrics also.

User Interface of the settings
The following remarks are based on the current Dynamic Degree Distribution settings. It can be slightly improved by:
  • Seeing the percentage value of the time interval otherwise the slider in imprecise like on Linux Gnome, and be able to edit it directly.
  • Adding explanations on the selected estimator.
  • Displaying the calendar only if the time is encoded as dates.
  • Displaying report options: to what models we would like to fit the distribution?
  • Displaying report options: do we want to perform an MLE and/or a KS test?

Summary - Changes in the Statistics module

Metrics classification
The current distinction between Network, Node and Edge Overview is fine, but I think some metrics are not on the right place. HITS and PageRank compute individual values for each nodes, so they should be placed inside Node Overview. Network Overview metrics would thus only compute a single number as a result. Node and Edge Overview metrics always compute distributions of values.

Process
We currently have the following distributions for nodes:
  • Degree
  • Clustering Coefficient
  • Eigenvector Centrality
  • Betweenness, Closeness Centrality and Eccentricity
For each of these metrics, I'd like to see a diagram of:
  • the evolution over time (x axis = time, y axis = value).
  • the global distribution (x axis = value, y axis = number of time this value appears)
  • the inverse cumulative distribution of the values
For each of these distributions, I'd like to view them at the following scales:
  • lin-lin
  • lin-log
  • log-log
And I'd like to evaluate their goodness of fit to the following models:
  • normal distribution / is data homogeneous?
  • power law distribution / is data heterogeneous?
UI simplification
Finally, static and dynamic metrics of the same distribution should be started with the same button. The UI should display the time window+estimator settings only if the current graph id dynamic, and the report adapted as a consequence.
EDIT: not so good as some metrics will run only for dynamical graphs. We could have two buttons aside of the "Settings" one: "Static" and "Dynamic". These buttons would update the list of available metrics in the Statistics Panel, so this would be clear and would not involve a refactoring of the current implemented metrics.

Extensibility
Inner code should provide the following APIs on these points:
  • diagram types
  • diagram scales
  • models
  • fit tests
Question
Considering your own practice, do you think this process sounds general / correct to be widely used?
I based these remarks on this paper.

morenobonaventura
Posts:1
Joined:11 Sep 2013 13:51
Location:Italy
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable

Re: (Dynamic) Statistics

Post by morenobonaventura » 12 Sep 2013 10:46

Dear Sebastiona,

it would be great to have more control over the data output.
The scales:
lin-lin
lin-log
log-log
are really useful!!!! ;-)

thanks for your great work with gephy ! and good luck with linkurio.us!!!! http://linkurio.us/

Post Reply
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable
[phpBB Debug] PHP Warning: in file [ROOT]/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1275: count(): Parameter must be an array or an object that implements Countable