Evaluation¶
The evaluation of Community Discovery algorithms is not an easy task.
CDlib
implements two families of evaluation strategies:
- Internal evaluation through quality scores
- External evaluation through partitions comparison
Fitness Functions¶
Fitness functions allows to summarize the characteristics of a computed set of communities. CDlib
implements the following quality scores:
avg_distance (graph, communities, **kwargs) |
Average distance. |
avg_embeddedness (graph, communities, **kwargs) |
Average embeddedness of nodes within the community. |
average_internal_degree (graph, community, …) |
The average internal degree of the community set. |
avg_transitivity (graph, communities, **kwargs) |
Average transitivity. |
conductance (graph, community, **kwargs) |
Fraction of total edge volume that points outside the community. |
cut_ratio (graph, community, **kwargs) |
Fraction of existing edges (out of all possible edges) leaving the community. |
edges_inside (graph, community, **kwargs) |
Number of edges internal to the community. |
expansion (graph, community, **kwargs) |
Number of edges per community node that point outside the cluster. |
fraction_over_median_degree (graph, …) |
Fraction of community nodes of having internal degree higher than the median degree value. |
hub_dominance (graph, communities, **kwargs) |
Hub dominance. |
internal_edge_density (graph, community, **kwargs) |
The internal density of the community set. |
normalized_cut (graph, community, **kwargs) |
Normalized variant of the Cut-Ratio |
max_odf (graph, community, **kwargs) |
Maximum fraction of edges of a node of a community that point outside the community itself. |
avg_odf (graph, community, **kwargs) |
Average fraction of edges of a node of a community that point outside the community itself. |
flake_odf (graph, community, **kwargs) |
Fraction of nodes in S that have fewer edges pointing inside than to the outside of the community. |
scaled_density (graph, communities, **kwargs) |
Scaled density. |
significance (graph, communities, **kwargs) |
Significance estimates how likely a partition of dense communities appear in a random graph. |
size (graph, communities, **kwargs) |
Size is the number of nodes in the community |
surprise (graph, communities, **kwargs) |
Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution. |
triangle_participation_ratio (graph, …) |
Fraction of community nodes that belong to a triad. |
purity (communities) |
Purity is the product of the frequencies of the most frequent labels carried by the nodes within the communities |
Among the fitness function a well-defined family of measures is the Modularity-based one:
erdos_renyi_modularity (graph, communities, …) |
Erdos-Renyi modularity is a variation of the Newman-Girvan one. |
link_modularity (graph, communities, **kwargs) |
Quality function designed for directed graphs with overlapping communities. |
modularity_density (graph, communities, lmbd, …) |
The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures. |
newman_girvan_modularity (graph, communities, …) |
Difference the fraction of intra community edges of a partition with the expected number of such edges if distributed according to a null model. |
z_modularity (graph, communities, **kwargs) |
Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit. |
Some measures will return an instance of FitnessResult
that takes together min/max/mean/std values of the computed index.
FitnessResult (min, max, score, std) |
Partition Comparisons¶
It is often useful to compare different graph partition to assess their resemblance (i.e., to perform ground truth testing).
CDlib
implements the following partition comparisons scores:
adjusted_mutual_information (first_partition, …) |
Adjusted Mutual Information between two clusterings. |
adjusted_rand_index (first_partition, …) |
Rand index adjusted for chance. |
f1 (first_partition, second_partition) |
Compute the average F1 score of the optimal algorithms matches among the partitions in input. |
nf1 (first_partition, second_partition) |
Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input. |
normalized_mutual_information (…) |
Normalized Mutual Information between two clusterings. |
omega (first_partition, second_partition) |
Index of resemblance for overlapping, complete coverage, network clusterings. |
overlapping_normalized_mutual_information_LFK (…) |
Overlapping Normalized Mutual Information between two clusterings. |
overlapping_normalized_mutual_information_MGH (…) |
Overlapping Normalized Mutual Information between two clusterings. |
variation_of_information (first_partition, …) |
Variation of Information among two nodes partitions. |
Some measures will return an instance of MatchingResult
that takes together mean and standard deviation values of the computed index.
MatchingResult (score, std) |