As I now have real-world use cases of analyzing dynamical data, I have a little better overview on what is interesting to implement.
There is a problem by considering the "(Dynamic) Degree Power Law" as a metric, because the power law is not a metric but a model which may fit the degree distribution. So the metric is "(Dynamic) Degree Distribution" because the question is "What is the distribution of the degrees over time?". Then one can try to fit it to a power law distribution or the normal distribution for example. So by computing the distribution of the degrees, the user should select the models to fit: power law and/or normal and they will appear in the report.
There are other "basic" distributions yet useful to analyze:
- Number of nodes at each time-window.
- Number of nodes in consecutive time-window.
- Connected components at each time-window.
The power law is not the only model one would like to test the fit. Gaussian model is also important. A Power-law indicates that the data is heterogeneous ; the Gaussian indicates that the data is homogeneous - always in a statistical point of view.
Plotting scales: it is important to visualize the distributions in 3 different scales, lin-lin, lin-log and log-log. They should be present every time also for static metrics.
Then one should evaluate the goodness of fit of the distributions with the MLE or its distance with the KS test so that one have a clue if there is really a power law or not. This test should be added for statics metrics also.
User Interface of the settings
The following remarks are based on the current Dynamic Degree Distribution settings. It can be slightly improved by:
- Seeing the percentage value of the time interval otherwise the slider in imprecise like on Linux Gnome, and be able to edit it directly.
- Adding explanations on the selected estimator.
- Displaying the calendar only if the time is encoded as dates.
- Displaying report options: to what models we would like to fit the distribution?
- Displaying report options: do we want to perform an MLE and/or a KS test?
Summary - Changes in the Statistics module
The current distinction between Network, Node and Edge Overview is fine, but I think some metrics are not on the right place. HITS and PageRank compute individual values for each nodes, so they should be placed inside Node Overview. Network Overview metrics would thus only compute a single number as a result. Node and Edge Overview metrics always compute distributions of values.
We currently have the following distributions for nodes:
- Clustering Coefficient
- Eigenvector Centrality
- Betweenness, Closeness Centrality and Eccentricity
- the evolution over time (x axis = time, y axis = value).
- the global distribution (x axis = value, y axis = number of time this value appears)
- the inverse cumulative distribution of the values
- normal distribution / is data homogeneous?
- power law distribution / is data heterogeneous?
Finally, static and dynamic metrics of the same distribution should be started with the same button. The UI should display the time window+estimator settings only if the current graph id dynamic, and the report adapted as a consequence.
EDIT: not so good as some metrics will run only for dynamical graphs. We could have two buttons aside of the "Settings" one: "Static" and "Dynamic". These buttons would update the list of available metrics in the Statistics Panel, so this would be clear and would not involve a refactoring of the current implemented metrics.
Inner code should provide the following APIs on these points:
- diagram types
- diagram scales
- fit tests
Considering your own practice, do you think this process sounds general / correct to be widely used?
I based these remarks on this paper.