CDlib - Community Discovery Library

CDlib is a Python software package that allows to extract, compare and evaluate communities from complex networks.

The library provides a standardized input/output for several existing Community Discovery algorithms. The implementations of all CD algorithms are inherited from existing projects, each one of them acknowledged in the dedicated method reference page.

Date Python Versions Main Author GitHub pypl
2021-03-03 3.7-3.8 Giulio Rossetti Source Distribution

CDlib Dev Team

Name Contribution
Giulio Rossetti Library Design/Documentation
Letizia Milli Community Models Integration
Rémy Cazabet Visualization
Salvatore Citraro Community Models Integration

Overview

CDlib is a Python language software package for the extraction, comparison, and evaluation of communities from complex networks.

Who uses CDlib?

The potential audience for CDlib includes mathematicians, physicists, biologists, computer scientists, and social scientists.

Goals

CDlib is built upon the NetworkX python library and is intended to provide:

  • a standard programming interface and community discovery implementations that are suitable for many applications,
  • a rapid development environment for collaborative, multidisciplinary, projects.

The Python CDlib library

CDlib is a powerful Python package that allows simple and flexible partitioning of complex networks.

Most importantly, CDlib, as well as the Python programming language, is free, well-supported, and a joy to use.

Free software

CDlib is free software; you can redistribute it and/or modify it under the terms of the BSD License. We welcome contributions from the community.

EU H2020

CDlib is a result of an European H2020 project:

  • SoBigData “Social Mining & Big Data Ecosystem”: under the scheme “INFRAIA-1-2014-2015: Research Infrastructures”, grant agreement #654024.

Installing CDlib

Before installing CDlib, you need to have setuptools installed.

Quick install

Get CDlib from the Python Package Index at pypl.

or install it with

pip install cdlib

and an attempt will be made to find and install an appropriate version that matches your operating system and Python version.

You can install the development version with

pip install git://github.com/GiulioRossetti/cdlib.git

Optional Dependencies

CDlib relies on a few packages calling C code (namely: python-igraph, leidenalg, angel_cd and infomap). The default installation will not set up such requirements since their configuration under non unix-like systems is not trivial and cannot be easily automated.

Such a choice has been made to allow (even) Windows user to install the library and get access to its core functionalities.

To made available (most of) the optional packages you can either:

  • (Windows) manually install the optional packages (versions details are specified in requirements_optional.txt) following the original projects guidelines, or
  • (Linux/OSX) run the command:
pip install cdlib[C]

Such caveat will install everything that can be easily automated under Linux/OSX.

(Advanced) Graph-tool

The only optional dependency that will remain unsatisfied following the previous procedures will be graph-tool (used to add SBM models). If you need it up and running, refer to the official documentation.

Installing from source

You can install from source by downloading a source archive file (tar.gz or zip) or by checking out the source files from the GitHub source code repository.

CDlib is a pure Python package; you don’t need a compiler to build or install it.

Source archive file

Download the source (tar.gz or zip file) from pypl or get the latest development version from GitHub

Unpack and change directory to the source directory (it should have the files README.txt and setup.py).

Run python setup.py install to build and install

GitHub

Clone the CDlib repostitory (see GitHub for options)

git clone https://github.com/GiulioRossetti/cdlib.git

Change directory to CDlib

Run python setup.py install to build and install

If you don’t have permission to install software on your system, you can install into another directory using the –user, –prefix, or –home flags to setup.py.

For example

python setup.py install --prefix=/home/username/python

or

python setup.py install --home=~

or

python setup.py install --user

If you didn’t install in the standard Python site-packages directory you will need to set your PYTHONPATH variable to the alternate location. See http://docs.python.org/2/install/index.html#search-path for further details.

Requirements

Python

To use CDlib you need Python 3.6 or later.

The easiest way to get Python and most optional packages is to install the Enthought Python distribution “Canopy” or using Anaconda.

There are several other distributions that contain the key packages you need for scientific computing.

Quick Start

CDlib is a python library that allows to extract, compare and evaluate network partitions. We designed it to be agnostic w.r.t. the data structure used to represent the network to be clustered: all the algorithms it implements accept interchangeably igraph/networkx objects.

Of course, such a choice comes with advantages as well as drawbacks. Here’s the main ones you have to be aware of:

Advantages - Easy integration of existing/novel (python implementation of) CD algorithms; - Standardization of input and output; - Zero-configuration user interface (e.g., you don’t have to reshape your data!)

Drawbacks - Algorithms performances are not comparable (execution time, scalability… they all depends on how each algorithm was originally implemented); - Memory (in)efficiency: depending by the type of structure each individual algorithm requires memory consumption could be high; - Hidden transformation times: usually not a bottleneck, moving from a graph representation to another can take “some” time (usually linear in the graph size)

Most importantly: remember that i) each algorithm will be able to handle graphs up to a given size, and that ii) that maximum size that may vary greatly across the exposed algorithms.

Tutorial

Extracting communities using CDlib is easy as this:

from cdlib import algorithms
import networkx as nx
G = nx.karate_club_graph()
coms = algorithms.louvain(G, weight='weight', resolution=1., randomize=False)

Of course, you can choose among all the algorithms available (taking care of specifying the correct parameters): in any case, you’ll get as a result a Clustering object (or a more specific subclass).

Clustering objects expose a set of methods to perform evaluation and comparisons. For instance, to get the partition modularity just write

mod = coms.newman_girvan_modularity(g)

or, equivalently

from cdlib import evaluation
mod = evaluation.newman_girvan_modularity(g,communities)

Moreover, you can also visualize networks and communities, plot indicators and similarity matrices… just take a look to the module reference to get a few examples.

I know, plain tutorials are overrated: if you want to explore CDlib functionalities, please start playing around with our interactive Google Colab Notebook!

FAQ

Q1. I developed a novel Community Discovery algorithm/evaluation/visual analytics method and I would like to see it integrated in CDlib. What should I do?

A1. That’s great! Just open an issue on the project GitHub briefly describing the method (provide a link to the paper where it has been firstly introduced) and links to a python implementation (if available). We’ll came back to you as soon as possible to discuss the next steps.

Q2. Can you add method XXX to your library?

A2. It depends. Do you have a link to a python implementation/are you willing to help us in implementing it? If so, that’s perfect. If not, well… everything is possible but it is likely that it will require some time.

Q3. I have a clustering obtained by an algorithm not included in CDlib. Can I load it in a Clustering object to leverage the evaluation and visualization facilities of your library?

A3. Yes you can. Just transform your clustering in a list of lists (e.g., we represent each community as a list of node ids) and then create a NodeClustering object from it.

from cdlib import NodeClustering

communities = [[1,2,3], [4,5,6], [7,8,9,10,11]]
coms = NodeClustering(communities, graph=None, method_name="your_method")

Of course, to compute some evaluation scores/plot community-networks you’ll also have to pass the original graph (as igraph/networkx object) while building the NodeClustering instance.

Reference

CDlib composes of several modules, each one fulfilling a different task related to community detection.

Community Objects

cdlib provides data structures and methods for storing communities.

The choice of community class depends on the structure of the community generated by the selected algorithm.

Which community should I use?
Community Type cdlib class
Node Partition NodeClustering, FuzzyNodeClustering, AttrNodeClustering, BiNodeClustering
Edge Partition EdgeClustering
Dynamic Partition TemporalClustering
Community Types
Node Clustering
Overview
class NodeClustering(communities, graph, method_name='', method_parameters=None, overlap=False)

Node Communities representation.

Parameters:
  • communities – list of communities
  • graph – a networkx/igraph object
  • method_name – community discovery algorithm name
  • method_parameters – configuration for the community discovery algorithm used
  • overlap – boolean, whether the partition is overlapping or not
adjusted_mutual_information(clustering)

Adjusted Mutual Information between two clusterings.

Adjusted Mutual Information (AMI) is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared. For two clusterings \(U\) and \(V\), the AMI is given as:

AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

Be mindful that this function is an order of magnitude slower than other metrics, such as the Adjusted Rand Index.

Parameters:clustering – NodeClustering object
Returns:AMI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_mutual_information(leiden_communities)
Reference:
  1. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837-2854.
adjusted_rand_index(clustering)

Rand index adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

ARI is a symmetric measure:

adjusted_rand_index(a, b) == adjusted_rand_index(b, a)
Parameters:clustering – NodeClustering object
Returns:ARI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_rand_index(leiden_communities)
Reference:
  1. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
average_internal_degree(**kwargs)

The average internal degree of the algorithms set.

\[f(S) = \frac{2m_S}{n_S}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.average_internal_degree()
avg_odf(**kwargs)

Average fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[\frac{1}{n_S} \sum_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>>
>>> communities = eva(g, alpha=alpha)
>>> pur = communities.purity()
conductance(**kwargs)

Fraction of total edge volume that points outside the algorithms.

\[f(S) = \frac{c_S}{2 m_S+c_S}\]

where \(c_S\) is the number of algorithms nodes and, \(m_S\) is the number of algorithms edges

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.conductance()
cut_ratio(**kwargs)

Fraction of existing edges (out of all possible edges) leaving the algorithms.

..math:: f(S) = frac{c_S}{n_S (n − n_S)}

where \(c_S\) is the number of algorithms nodes and, \(n_S\) is the number of edges on the algorithms boundary

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.cut_ratio()
edges_inside(**kwargs)

Number of edges internal to the algorithms.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.edges_inside()
erdos_renyi_modularity()

Erdos-Renyi modularity is a variation of the Newman-Girvan one. It assumes that vertices in a network are connected randomly with a constant probability \(p\).

\[Q(S) = \frac{1}{m}\sum_{c \in S} (m_S − \frac{mn_S(n_S −1)}{n(n−1)})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Erdos-Renyi modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.erdos_renyi_modularity()
References:

Erdos, P., & Renyi, A. (1959). On random graphs I. Publ. Math. Debrecen, 6, 290-297.

expansion(**kwargs)

Number of edges per algorithms node that point outside the cluster.

\[f(S) = \frac{c_S}{n_S}\]

where \(n_S\) is the number of edges on the algorithms boundary, \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.expansion()
f1(clustering)

Compute the average F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:F1 score (harmonic mean of precision and recall)
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.f1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth. In Complex Networks VII (pp. 133-144). Springer, Cham.
flake_odf(**kwargs)

Fraction of nodes in S that have fewer edges pointing inside than to the outside of the algorithms.

\[f(S) = \frac{| \{ u:u \in S,| \{(u,v) \in E: v \in S \}| < d(u)/2 \}|}{n_S}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.flake_odf()
fraction_over_median_degree(**kwargs)

Fraction of algorithms nodes of having internal degree higher than the median degree value.

\[f(S) = \frac{|\{u: u \in S,| \{(u,v): v \in S\}| > d_m\}| }{n_S}\]

where \(d_m\) is the internal degree median value

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.fraction_over_median_degree()
get_description(parameters_to_display=None, precision=3)

Return a description of the clustering, with the name of the method and its numeric parameters.

Parameters:
  • parameters_to_display – parameters to display. By default, all float parameters.
  • precision – precision used to plot parameters. default: 3
Returns:

a string description of the method.

internal_edge_density(**kwargs)

The internal density of the algorithms set.

\[f(S) = \frac{m_S}{n_S(n_S−1)/2}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.internal_edge_density()

Quality function designed for directed graphs with overlapping communities.

Returns:the link modularity score
Example:
>>> from cdlib import evaluation
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.link_modularity()
max_odf(**kwargs)

Maximum fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[max_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\) and \(d(u)\) is the degree of \(u\)

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.max_odf()
modularity_density()

The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures. The idea of this metric is to include the information about algorithms size into the expected density of algorithms to avoid the negligence of small and dense communities. For each algorithms \(C\) in partition \(S\), it uses the average modularity degree calculated by \(d(C) = d^{int(C)} − d^{ext(C)}\) where \(d^{int(C)}\) and \(d^{ext(C)}\) are the average internal and external degrees of \(C\) respectively to evaluate the fitness of \(C\) in its network. Finally, the modularity density can be calculated as follows:

\[Q(S) = \sum_{C \in S} \frac{1}{n_C} ( \sum_{i \in C} k^{int}_{iC} - \sum_{i \in C} k^{out}_{iC})\]

where \(n_C\) is the number of nodes in C, \(k^{int}_{iC}\) is the degree of node i within \(C\) and \(k^{out}_{iC}\) is the deree of node i outside \(C\).

Returns:the modularity density score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.modularity_density()
References:

Li, Z., Zhang, S., Wang, R. S., Zhang, X. S., & Chen, L. (2008). Quantitative function for algorithms detection. Physical review E, 77(3), 036109.

newman_girvan_modularity()

Difference the fraction of intra algorithms edges of a partition with the expected number of such edges if distributed according to a null model.

In the standard version of modularity, the null model preserves the expected degree sequence of the graph under consideration. In other words, the modularity compares the real network structure with a corresponding one where nodes are connected without any preference about their neighbors.

\[Q(S) = \frac{1}{m}\sum_{c \in S}(m_S - \frac{(2 m_S + l_S)^2}{4m})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Newman-Girvan modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.newman_girvan_modularity()
References:

Newman, M.E.J. & Girvan, M. Finding and evaluating algorithms structure in networks. Physical Review E 69, 26113(2004).

nf1(clustering)

Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:MatchingResult instance
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.nf1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth.
  2. Rossetti, G. (2017). : RDyn: graph benchmark handling algorithms dynamics. Journal of Complex Networks. 5(6), 893-912.
normalized_cut(**kwargs)

Normalized variant of the Cut-Ratio

\[: f(S) = \frac{c_S}{2m_S+c_S} + \frac{c_S}{2(m−m_S )+c_S}\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms internal edges and \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.normalized_cut()
normalized_mutual_information(clustering)

Normalized Mutual Information between two clusterings.

Normalized Mutual Information (NMI) is an normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by sqrt(H(labels_true) * H(labels_pred))

Parameters:clustering – NodeClustering object
Returns:normalized mutual information score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.normalized_mutual_information(leiden_communities)
omega(clustering)

Index of resemblance for overlapping, complete coverage, network clusterings.

Parameters:clustering – NodeClustering object
Returns:omega index
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.omega(leiden_communities)
Reference:
  1. Gabriel Murray, Giuseppe Carenini, and Raymond Ng. 2012. Using the omega index for evaluating abstractive algorithms detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. Association for Computational Linguistics, Stroudsburg, PA, USA, 10-18.
overlapping_normalized_mutual_information_LFK(clustering)

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by Lancichinetti et al.

Parameters:clustering – NodeClustering object
Returns:onmi score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.overlapping_normalized_mutual_information_LFK(leiden_communities)
Reference:
  1. Lancichinetti, A., Fortunato, S., & Kertesz, J. (2009). Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3), 033015.
overlapping_normalized_mutual_information_MGH(clustering, normalization='max')

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by McDaid et al. using a different normalization than the original LFR one. See ref. for more details.

Parameters:
  • clustering – NodeClustering object
  • normalization – one of “max” or “LFK”. Default “max” (corresponds to the main method described in the article)
Returns:

onmi score

Example:
>>> from cdlib import evaluation, algorithms
>>> g = nx.karate_club_graph()
>>> louvain_communities = algorithms.louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> evaluation.overlapping_normalized_mutual_information_MGH(louvain_communities,leiden_communities)
:Reference:
  1. McDaid, A. F., Greene, D., & Hurley, N. (2011). Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515. Chicago
significance()

Significance estimates how likely a partition of dense communities appear in a random graph.

Returns:the significance score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.significance()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

size(**kwargs)

Size is the number of nodes in the community

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score

Example:

>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.size()
surprise()

Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.

According to the Surprise metric, the higher the score of a partition, the less likely it is resulted from a random realization, the better the quality of the algorithms structure.

Returns:the surprise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.surprise()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

to_json()

Generate a JSON representation of the algorithms object

Returns:a JSON formatted string representing the object
to_node_community_map()

Generate a <node, list(communities)> representation of the current clustering

Returns:dict of the form <node, list(communities)>
triangle_participation_ratio(**kwargs)

Fraction of algorithms nodes that belong to a triad.

\[f(S) = \frac{ | \{ u: u \in S,\{(v,w):v, w \in S,(u,v) \in E,(u,w) \in E,(v,w) \in E \} \not = \emptyset \} |}{n_S}\]

where \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.triangle_participation_ratio()
variation_of_information(clustering)

Variation of Information among two nodes partitions.

$$ H(p)+H(q)-2MI(p, q) $$

where MI is the mutual information, H the partition entropy and p,q are the algorithms sets

Parameters:clustering – NodeClustering object
Returns:VI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.variation_of_information(leiden_communities)
Reference:
  1. Meila, M. (2007). Comparing clusterings - an information based distance. Journal of Multivariate Analysis, 98, 873-895. doi:10.1016/j.jmva.2006.11.013
z_modularity()

Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit. The concept of this version is based on an observation that the difference between the fraction of edges inside communities and the expected number of such edges in a null model should not be considered as the only contribution to the final quality of algorithms structure.

Returns:the z-modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.z_modularity()
References:

Miyauchi, Atsushi, and Yasushi Kawase. Z-score-based modularity for algorithms detection in networks. PloS one 11.1 (2016): e0147805.

Methods
Data transformation and IO
NodeClustering.to_json() Generate a JSON representation of the algorithms object
NodeClustering.to_node_community_map() Generate a <node, list(communities)> representation of the current clustering
Evaluating Node Clustering
NodeClustering.link_modularity() Quality function designed for directed graphs with overlapping communities.
NodeClustering.normalized_cut(**kwargs) Normalized variant of the Cut-Ratio
NodeClustering.internal_edge_density(**kwargs) The internal density of the algorithms set.
NodeClustering.average_internal_degree(**kwargs) The average internal degree of the algorithms set.
NodeClustering.fraction_over_median_degree(…) Fraction of algorithms nodes of having internal degree higher than the median degree value.
NodeClustering.expansion(**kwargs) Number of edges per algorithms node that point outside the cluster.
NodeClustering.cut_ratio(**kwargs) Fraction of existing edges (out of all possible edges) leaving the algorithms.
NodeClustering.edges_inside(**kwargs) Number of edges internal to the algorithms.
NodeClustering.conductance(**kwargs) Fraction of total edge volume that points outside the algorithms.
NodeClustering.max_odf(**kwargs) Maximum fraction of edges of a node of a algorithms that point outside the algorithms itself.
NodeClustering.avg_odf(**kwargs) Average fraction of edges of a node of a algorithms that point outside the algorithms itself.
NodeClustering.flake_odf(**kwargs) Fraction of nodes in S that have fewer edges pointing inside than to the outside of the algorithms.
NodeClustering.triangle_participation_ratio(…) Fraction of algorithms nodes that belong to a triad.
NodeClustering.newman_girvan_modularity() Difference the fraction of intra algorithms edges of a partition with the expected number of such edges if distributed according to a null model.
NodeClustering.erdos_renyi_modularity() Erdos-Renyi modularity is a variation of the Newman-Girvan one.
NodeClustering.modularity_density() The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures.
NodeClustering.z_modularity() Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit.
NodeClustering.surprise() Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.
NodeClustering.significance() Significance estimates how likely a partition of dense communities appear in a random graph.
Comparing Node Clusterings
NodeClustering.normalized_mutual_information(…) Normalized Mutual Information between two clusterings.
NodeClustering.overlapping_normalized_mutual_information_MGH(…) Overlapping Normalized Mutual Information between two clusterings.
NodeClustering.overlapping_normalized_mutual_information_LFK(…) Overlapping Normalized Mutual Information between two clusterings.
NodeClustering.omega(clustering) Index of resemblance for overlapping, complete coverage, network clusterings.
NodeClustering.f1(clustering) Compute the average F1 score of the optimal algorithms matches among the partitions in input.
NodeClustering.nf1(clustering) Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input.
NodeClustering.adjusted_rand_index(clustering) Rand index adjusted for chance.
NodeClustering.adjusted_mutual_information(…) Adjusted Mutual Information between two clusterings.
NodeClustering.variation_of_information(…) Variation of Information among two nodes partitions.
Fuzzy Node Clustering
Overview
class FuzzyNodeClustering(communities, node_allocation, graph, method_name='', method_parameters=None, overlap=False)

Fuzzy Node Communities representation.

Parameters:
  • communities – list of communities
  • node_allocation – dictionary specifying for each node the allocation of probability toward the communities it is placed in
  • graph – a networkx/igraph object
  • method_name – community discovery algorithm name
  • method_parameters – configuration for the community discovery algorithm used
  • overlap – boolean, whether the partition is overlapping or not
adjusted_mutual_information(clustering)

Adjusted Mutual Information between two clusterings.

Adjusted Mutual Information (AMI) is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared. For two clusterings \(U\) and \(V\), the AMI is given as:

AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

Be mindful that this function is an order of magnitude slower than other metrics, such as the Adjusted Rand Index.

Parameters:clustering – NodeClustering object
Returns:AMI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_mutual_information(leiden_communities)
Reference:
  1. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837-2854.
adjusted_rand_index(clustering)

Rand index adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

ARI is a symmetric measure:

adjusted_rand_index(a, b) == adjusted_rand_index(b, a)
Parameters:clustering – NodeClustering object
Returns:ARI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_rand_index(leiden_communities)
Reference:
  1. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
average_internal_degree(**kwargs)

The average internal degree of the algorithms set.

\[f(S) = \frac{2m_S}{n_S}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.average_internal_degree()
avg_odf(**kwargs)

Average fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[\frac{1}{n_S} \sum_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>>
>>> communities = eva(g, alpha=alpha)
>>> pur = communities.purity()
conductance(**kwargs)

Fraction of total edge volume that points outside the algorithms.

\[f(S) = \frac{c_S}{2 m_S+c_S}\]

where \(c_S\) is the number of algorithms nodes and, \(m_S\) is the number of algorithms edges

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.conductance()
cut_ratio(**kwargs)

Fraction of existing edges (out of all possible edges) leaving the algorithms.

..math:: f(S) = frac{c_S}{n_S (n − n_S)}

where \(c_S\) is the number of algorithms nodes and, \(n_S\) is the number of edges on the algorithms boundary

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.cut_ratio()
edges_inside(**kwargs)

Number of edges internal to the algorithms.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.edges_inside()
erdos_renyi_modularity()

Erdos-Renyi modularity is a variation of the Newman-Girvan one. It assumes that vertices in a network are connected randomly with a constant probability \(p\).

\[Q(S) = \frac{1}{m}\sum_{c \in S} (m_S − \frac{mn_S(n_S −1)}{n(n−1)})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Erdos-Renyi modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.erdos_renyi_modularity()
References:

Erdos, P., & Renyi, A. (1959). On random graphs I. Publ. Math. Debrecen, 6, 290-297.

expansion(**kwargs)

Number of edges per algorithms node that point outside the cluster.

\[f(S) = \frac{c_S}{n_S}\]

where \(n_S\) is the number of edges on the algorithms boundary, \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.expansion()
f1(clustering)

Compute the average F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:F1 score (harmonic mean of precision and recall)
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.f1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth. In Complex Networks VII (pp. 133-144). Springer, Cham.
flake_odf(**kwargs)

Fraction of nodes in S that have fewer edges pointing inside than to the outside of the algorithms.

\[f(S) = \frac{| \{ u:u \in S,| \{(u,v) \in E: v \in S \}| < d(u)/2 \}|}{n_S}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.flake_odf()
fraction_over_median_degree(**kwargs)

Fraction of algorithms nodes of having internal degree higher than the median degree value.

\[f(S) = \frac{|\{u: u \in S,| \{(u,v): v \in S\}| > d_m\}| }{n_S}\]

where \(d_m\) is the internal degree median value

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.fraction_over_median_degree()
get_description(parameters_to_display=None, precision=3)

Return a description of the clustering, with the name of the method and its numeric parameters.

Parameters:
  • parameters_to_display – parameters to display. By default, all float parameters.
  • precision – precision used to plot parameters. default: 3
Returns:

a string description of the method.

internal_edge_density(**kwargs)

The internal density of the algorithms set.

\[f(S) = \frac{m_S}{n_S(n_S−1)/2}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.internal_edge_density()

Quality function designed for directed graphs with overlapping communities.

Returns:the link modularity score
Example:
>>> from cdlib import evaluation
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.link_modularity()
max_odf(**kwargs)

Maximum fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[max_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\) and \(d(u)\) is the degree of \(u\)

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.max_odf()
modularity_density()

The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures. The idea of this metric is to include the information about algorithms size into the expected density of algorithms to avoid the negligence of small and dense communities. For each algorithms \(C\) in partition \(S\), it uses the average modularity degree calculated by \(d(C) = d^{int(C)} − d^{ext(C)}\) where \(d^{int(C)}\) and \(d^{ext(C)}\) are the average internal and external degrees of \(C\) respectively to evaluate the fitness of \(C\) in its network. Finally, the modularity density can be calculated as follows:

\[Q(S) = \sum_{C \in S} \frac{1}{n_C} ( \sum_{i \in C} k^{int}_{iC} - \sum_{i \in C} k^{out}_{iC})\]

where \(n_C\) is the number of nodes in C, \(k^{int}_{iC}\) is the degree of node i within \(C\) and \(k^{out}_{iC}\) is the deree of node i outside \(C\).

Returns:the modularity density score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.modularity_density()
References:

Li, Z., Zhang, S., Wang, R. S., Zhang, X. S., & Chen, L. (2008). Quantitative function for algorithms detection. Physical review E, 77(3), 036109.

newman_girvan_modularity()

Difference the fraction of intra algorithms edges of a partition with the expected number of such edges if distributed according to a null model.

In the standard version of modularity, the null model preserves the expected degree sequence of the graph under consideration. In other words, the modularity compares the real network structure with a corresponding one where nodes are connected without any preference about their neighbors.

\[Q(S) = \frac{1}{m}\sum_{c \in S}(m_S - \frac{(2 m_S + l_S)^2}{4m})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Newman-Girvan modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.newman_girvan_modularity()
References:

Newman, M.E.J. & Girvan, M. Finding and evaluating algorithms structure in networks. Physical Review E 69, 26113(2004).

nf1(clustering)

Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:MatchingResult instance
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.nf1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth.
  2. Rossetti, G. (2017). : RDyn: graph benchmark handling algorithms dynamics. Journal of Complex Networks. 5(6), 893-912.
normalized_cut(**kwargs)

Normalized variant of the Cut-Ratio

\[: f(S) = \frac{c_S}{2m_S+c_S} + \frac{c_S}{2(m−m_S )+c_S}\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms internal edges and \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.normalized_cut()
normalized_mutual_information(clustering)

Normalized Mutual Information between two clusterings.

Normalized Mutual Information (NMI) is an normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by sqrt(H(labels_true) * H(labels_pred))

Parameters:clustering – NodeClustering object
Returns:normalized mutual information score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.normalized_mutual_information(leiden_communities)
omega(clustering)

Index of resemblance for overlapping, complete coverage, network clusterings.

Parameters:clustering – NodeClustering object
Returns:omega index
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.omega(leiden_communities)
Reference:
  1. Gabriel Murray, Giuseppe Carenini, and Raymond Ng. 2012. Using the omega index for evaluating abstractive algorithms detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. Association for Computational Linguistics, Stroudsburg, PA, USA, 10-18.
overlapping_normalized_mutual_information_LFK(clustering)

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by Lancichinetti et al.

Parameters:clustering – NodeClustering object
Returns:onmi score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.overlapping_normalized_mutual_information_LFK(leiden_communities)
Reference:
  1. Lancichinetti, A., Fortunato, S., & Kertesz, J. (2009). Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3), 033015.
overlapping_normalized_mutual_information_MGH(clustering, normalization='max')

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by McDaid et al. using a different normalization than the original LFR one. See ref. for more details.

Parameters:
  • clustering – NodeClustering object
  • normalization – one of “max” or “LFK”. Default “max” (corresponds to the main method described in the article)
Returns:

onmi score

Example:
>>> from cdlib import evaluation, algorithms
>>> g = nx.karate_club_graph()
>>> louvain_communities = algorithms.louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> evaluation.overlapping_normalized_mutual_information_MGH(louvain_communities,leiden_communities)
:Reference:
  1. McDaid, A. F., Greene, D., & Hurley, N. (2011). Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515. Chicago
significance()

Significance estimates how likely a partition of dense communities appear in a random graph.

Returns:the significance score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.significance()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

size(**kwargs)

Size is the number of nodes in the community

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score

Example:

>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.size()
surprise()

Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.

According to the Surprise metric, the higher the score of a partition, the less likely it is resulted from a random realization, the better the quality of the algorithms structure.

Returns:the surprise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.surprise()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

to_json()

Generate a JSON representation of the algorithms object

Returns:a JSON formatted string representing the object
to_node_community_map()

Generate a <node, list(communities)> representation of the current clustering

Returns:dict of the form <node, list(communities)>
triangle_participation_ratio(**kwargs)

Fraction of algorithms nodes that belong to a triad.

\[f(S) = \frac{ | \{ u: u \in S,\{(v,w):v, w \in S,(u,v) \in E,(u,w) \in E,(v,w) \in E \} \not = \emptyset \} |}{n_S}\]

where \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.triangle_participation_ratio()
variation_of_information(clustering)

Variation of Information among two nodes partitions.

$$ H(p)+H(q)-2MI(p, q) $$

where MI is the mutual information, H the partition entropy and p,q are the algorithms sets

Parameters:clustering – NodeClustering object
Returns:VI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.variation_of_information(leiden_communities)
Reference:
  1. Meila, M. (2007). Comparing clusterings - an information based distance. Journal of Multivariate Analysis, 98, 873-895. doi:10.1016/j.jmva.2006.11.013
z_modularity()

Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit. The concept of this version is based on an observation that the difference between the fraction of edges inside communities and the expected number of such edges in a null model should not be considered as the only contribution to the final quality of algorithms structure.

Returns:the z-modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.z_modularity()
References:

Miyauchi, Atsushi, and Yasushi Kawase. Z-score-based modularity for algorithms detection in networks. PloS one 11.1 (2016): e0147805.

Methods
Data transformation and IO
FuzzyNodeClustering.to_json() Generate a JSON representation of the algorithms object
FuzzyNodeClustering.to_node_community_map() Generate a <node, list(communities)> representation of the current clustering
Evaluating Node Clustering
FuzzyNodeClustering.link_modularity() Quality function designed for directed graphs with overlapping communities.
FuzzyNodeClustering.normalized_cut(**kwargs) Normalized variant of the Cut-Ratio
FuzzyNodeClustering.internal_edge_density(…) The internal density of the algorithms set.
FuzzyNodeClustering.average_internal_degree(…) The average internal degree of the algorithms set.
FuzzyNodeClustering.fraction_over_median_degree(…) Fraction of algorithms nodes of having internal degree higher than the median degree value.
FuzzyNodeClustering.expansion(**kwargs) Number of edges per algorithms node that point outside the cluster.
FuzzyNodeClustering.cut_ratio(**kwargs) Fraction of existing edges (out of all possible edges) leaving the algorithms.
FuzzyNodeClustering.edges_inside(**kwargs) Number of edges internal to the algorithms.
FuzzyNodeClustering.conductance(**kwargs) Fraction of total edge volume that points outside the algorithms.
FuzzyNodeClustering.max_odf(**kwargs) Maximum fraction of edges of a node of a algorithms that point outside the algorithms itself.
FuzzyNodeClustering.avg_odf(**kwargs) Average fraction of edges of a node of a algorithms that point outside the algorithms itself.
FuzzyNodeClustering.flake_odf(**kwargs) Fraction of nodes in S that have fewer edges pointing inside than to the outside of the algorithms.
FuzzyNodeClustering.triangle_participation_ratio(…) Fraction of algorithms nodes that belong to a triad.
FuzzyNodeClustering.newman_girvan_modularity() Difference the fraction of intra algorithms edges of a partition with the expected number of such edges if distributed according to a null model.
FuzzyNodeClustering.erdos_renyi_modularity() Erdos-Renyi modularity is a variation of the Newman-Girvan one.
FuzzyNodeClustering.modularity_density() The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures.
FuzzyNodeClustering.z_modularity() Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit.
FuzzyNodeClustering.surprise() Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.
FuzzyNodeClustering.significance() Significance estimates how likely a partition of dense communities appear in a random graph.
Attributed Node Clustering
Overview
class AttrNodeClustering(communities, graph, method_name='', coms_labels=None, method_parameters=None, overlap=False)

Attribute Node Communities representation.

Parameters:
  • communities – list of communities
  • graph – a networkx/igraph object
  • method_name – community discovery algorithm name
  • coms_labels – dictionary specifying for each community the frequency of the attribute values
  • method_parameters – configuration for the community discovery algorithm used
  • overlap – boolean, whether the partition is overlapping or not
adjusted_mutual_information(clustering)

Adjusted Mutual Information between two clusterings.

Adjusted Mutual Information (AMI) is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared. For two clusterings \(U\) and \(V\), the AMI is given as:

AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

Be mindful that this function is an order of magnitude slower than other metrics, such as the Adjusted Rand Index.

Parameters:clustering – NodeClustering object
Returns:AMI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_mutual_information(leiden_communities)
Reference:
  1. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837-2854.
adjusted_rand_index(clustering)

Rand index adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

ARI is a symmetric measure:

adjusted_rand_index(a, b) == adjusted_rand_index(b, a)
Parameters:clustering – NodeClustering object
Returns:ARI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_rand_index(leiden_communities)
Reference:
  1. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
average_internal_degree(**kwargs)

The average internal degree of the algorithms set.

\[f(S) = \frac{2m_S}{n_S}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.average_internal_degree()
avg_odf(**kwargs)

Average fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[\frac{1}{n_S} \sum_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>>
>>> communities = eva(g, alpha=alpha)
>>> pur = communities.purity()
conductance(**kwargs)

Fraction of total edge volume that points outside the algorithms.

\[f(S) = \frac{c_S}{2 m_S+c_S}\]

where \(c_S\) is the number of algorithms nodes and, \(m_S\) is the number of algorithms edges

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.conductance()
cut_ratio(**kwargs)

Fraction of existing edges (out of all possible edges) leaving the algorithms.

..math:: f(S) = frac{c_S}{n_S (n − n_S)}

where \(c_S\) is the number of algorithms nodes and, \(n_S\) is the number of edges on the algorithms boundary

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.cut_ratio()
edges_inside(**kwargs)

Number of edges internal to the algorithms.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.edges_inside()
erdos_renyi_modularity()

Erdos-Renyi modularity is a variation of the Newman-Girvan one. It assumes that vertices in a network are connected randomly with a constant probability \(p\).

\[Q(S) = \frac{1}{m}\sum_{c \in S} (m_S − \frac{mn_S(n_S −1)}{n(n−1)})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Erdos-Renyi modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.erdos_renyi_modularity()
References:

Erdos, P., & Renyi, A. (1959). On random graphs I. Publ. Math. Debrecen, 6, 290-297.

expansion(**kwargs)

Number of edges per algorithms node that point outside the cluster.

\[f(S) = \frac{c_S}{n_S}\]

where \(n_S\) is the number of edges on the algorithms boundary, \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.expansion()
f1(clustering)

Compute the average F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:F1 score (harmonic mean of precision and recall)
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.f1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth. In Complex Networks VII (pp. 133-144). Springer, Cham.
flake_odf(**kwargs)

Fraction of nodes in S that have fewer edges pointing inside than to the outside of the algorithms.

\[f(S) = \frac{| \{ u:u \in S,| \{(u,v) \in E: v \in S \}| < d(u)/2 \}|}{n_S}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.flake_odf()
fraction_over_median_degree(**kwargs)

Fraction of algorithms nodes of having internal degree higher than the median degree value.

\[f(S) = \frac{|\{u: u \in S,| \{(u,v): v \in S\}| > d_m\}| }{n_S}\]

where \(d_m\) is the internal degree median value

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.fraction_over_median_degree()
get_description(parameters_to_display=None, precision=3)

Return a description of the clustering, with the name of the method and its numeric parameters.

Parameters:
  • parameters_to_display – parameters to display. By default, all float parameters.
  • precision – precision used to plot parameters. default: 3
Returns:

a string description of the method.

internal_edge_density(**kwargs)

The internal density of the algorithms set.

\[f(S) = \frac{m_S}{n_S(n_S−1)/2}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.internal_edge_density()

Quality function designed for directed graphs with overlapping communities.

Returns:the link modularity score
Example:
>>> from cdlib import evaluation
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.link_modularity()
max_odf(**kwargs)

Maximum fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[max_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\) and \(d(u)\) is the degree of \(u\)

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.max_odf()
modularity_density()

The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures. The idea of this metric is to include the information about algorithms size into the expected density of algorithms to avoid the negligence of small and dense communities. For each algorithms \(C\) in partition \(S\), it uses the average modularity degree calculated by \(d(C) = d^{int(C)} − d^{ext(C)}\) where \(d^{int(C)}\) and \(d^{ext(C)}\) are the average internal and external degrees of \(C\) respectively to evaluate the fitness of \(C\) in its network. Finally, the modularity density can be calculated as follows:

\[Q(S) = \sum_{C \in S} \frac{1}{n_C} ( \sum_{i \in C} k^{int}_{iC} - \sum_{i \in C} k^{out}_{iC})\]

where \(n_C\) is the number of nodes in C, \(k^{int}_{iC}\) is the degree of node i within \(C\) and \(k^{out}_{iC}\) is the deree of node i outside \(C\).

Returns:the modularity density score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.modularity_density()
References:

Li, Z., Zhang, S., Wang, R. S., Zhang, X. S., & Chen, L. (2008). Quantitative function for algorithms detection. Physical review E, 77(3), 036109.

newman_girvan_modularity()

Difference the fraction of intra algorithms edges of a partition with the expected number of such edges if distributed according to a null model.

In the standard version of modularity, the null model preserves the expected degree sequence of the graph under consideration. In other words, the modularity compares the real network structure with a corresponding one where nodes are connected without any preference about their neighbors.

\[Q(S) = \frac{1}{m}\sum_{c \in S}(m_S - \frac{(2 m_S + l_S)^2}{4m})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Newman-Girvan modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.newman_girvan_modularity()
References:

Newman, M.E.J. & Girvan, M. Finding and evaluating algorithms structure in networks. Physical Review E 69, 26113(2004).

nf1(clustering)

Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:MatchingResult instance
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.nf1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth.
  2. Rossetti, G. (2017). : RDyn: graph benchmark handling algorithms dynamics. Journal of Complex Networks. 5(6), 893-912.
normalized_cut(**kwargs)

Normalized variant of the Cut-Ratio

\[: f(S) = \frac{c_S}{2m_S+c_S} + \frac{c_S}{2(m−m_S )+c_S}\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms internal edges and \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.normalized_cut()
normalized_mutual_information(clustering)

Normalized Mutual Information between two clusterings.

Normalized Mutual Information (NMI) is an normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by sqrt(H(labels_true) * H(labels_pred))

Parameters:clustering – NodeClustering object
Returns:normalized mutual information score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.normalized_mutual_information(leiden_communities)
omega(clustering)

Index of resemblance for overlapping, complete coverage, network clusterings.

Parameters:clustering – NodeClustering object
Returns:omega index
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.omega(leiden_communities)
Reference:
  1. Gabriel Murray, Giuseppe Carenini, and Raymond Ng. 2012. Using the omega index for evaluating abstractive algorithms detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. Association for Computational Linguistics, Stroudsburg, PA, USA, 10-18.
overlapping_normalized_mutual_information_LFK(clustering)

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by Lancichinetti et al.

Parameters:clustering – NodeClustering object
Returns:onmi score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.overlapping_normalized_mutual_information_LFK(leiden_communities)
Reference:
  1. Lancichinetti, A., Fortunato, S., & Kertesz, J. (2009). Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3), 033015.
overlapping_normalized_mutual_information_MGH(clustering, normalization='max')

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by McDaid et al. using a different normalization than the original LFR one. See ref. for more details.

Parameters:
  • clustering – NodeClustering object
  • normalization – one of “max” or “LFK”. Default “max” (corresponds to the main method described in the article)
Returns:

onmi score

Example:
>>> from cdlib import evaluation, algorithms
>>> g = nx.karate_club_graph()
>>> louvain_communities = algorithms.louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> evaluation.overlapping_normalized_mutual_information_MGH(louvain_communities,leiden_communities)
:Reference:
  1. McDaid, A. F., Greene, D., & Hurley, N. (2011). Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515. Chicago
purity()

Purity is the product of the frequencies of the most frequent labels carried by the nodes within the communities :return: FitnessResult object

significance()

Significance estimates how likely a partition of dense communities appear in a random graph.

Returns:the significance score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.significance()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

size(**kwargs)

Size is the number of nodes in the community

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score

Example:

>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.size()
surprise()

Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.

According to the Surprise metric, the higher the score of a partition, the less likely it is resulted from a random realization, the better the quality of the algorithms structure.

Returns:the surprise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.surprise()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

to_json()

Generate a JSON representation of the algorithms object

Returns:a JSON formatted string representing the object
to_node_community_map()

Generate a <node, list(communities)> representation of the current clustering

Returns:dict of the form <node, list(communities)>
triangle_participation_ratio(**kwargs)

Fraction of algorithms nodes that belong to a triad.

\[f(S) = \frac{ | \{ u: u \in S,\{(v,w):v, w \in S,(u,v) \in E,(u,w) \in E,(v,w) \in E \} \not = \emptyset \} |}{n_S}\]

where \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.triangle_participation_ratio()
variation_of_information(clustering)

Variation of Information among two nodes partitions.

$$ H(p)+H(q)-2MI(p, q) $$

where MI is the mutual information, H the partition entropy and p,q are the algorithms sets

Parameters:clustering – NodeClustering object
Returns:VI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.variation_of_information(leiden_communities)
Reference:
  1. Meila, M. (2007). Comparing clusterings - an information based distance. Journal of Multivariate Analysis, 98, 873-895. doi:10.1016/j.jmva.2006.11.013
z_modularity()

Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit. The concept of this version is based on an observation that the difference between the fraction of edges inside communities and the expected number of such edges in a null model should not be considered as the only contribution to the final quality of algorithms structure.

Returns:the z-modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.z_modularity()
References:

Miyauchi, Atsushi, and Yasushi Kawase. Z-score-based modularity for algorithms detection in networks. PloS one 11.1 (2016): e0147805.

Methods
Evaluating Node Clustering
AttrNodeClustering.purity() Purity is the product of the frequencies of the most frequent labels carried by the nodes within the communities :return: FitnessResult object
Biparite Node Clustering
Overview
class BiNodeClustering(left_communities, right_communities, graph, method_name='', method_parameters=None, overlap=False)

Bipartite Node Communities representation.

Parameters:
  • left_communities – list of left communities
  • right_communities – list of right communities
  • graph – a networkx/igraph object
  • method_name – community discovery algorithm name
  • method_parameters – configuration for the community discovery algorithm used
  • overlap – boolean, whether the partition is overlapping or not
adjusted_mutual_information(clustering)

Adjusted Mutual Information between two clusterings.

Adjusted Mutual Information (AMI) is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared. For two clusterings \(U\) and \(V\), the AMI is given as:

AMI(U, V) = [MI(U, V) - E(MI(U, V))] / [max(H(U), H(V)) - E(MI(U, V))]

This metric is independent of the absolute values of the labels: a permutation of the class or cluster label values won’t change the score value in any way.

This metric is furthermore symmetric: switching label_true with label_pred will return the same score value. This can be useful to measure the agreement of two independent label assignments strategies on the same dataset when the real ground truth is not known.

Be mindful that this function is an order of magnitude slower than other metrics, such as the Adjusted Rand Index.

Parameters:clustering – NodeClustering object
Returns:AMI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_mutual_information(leiden_communities)
Reference:
  1. Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837-2854.
adjusted_rand_index(clustering)

Rand index adjusted for chance.

The Rand Index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

The adjusted Rand index is thus ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical (up to a permutation).

ARI is a symmetric measure:

adjusted_rand_index(a, b) == adjusted_rand_index(b, a)
Parameters:clustering – NodeClustering object
Returns:ARI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.adjusted_rand_index(leiden_communities)
Reference:
  1. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218.
average_internal_degree(**kwargs)

The average internal degree of the algorithms set.

\[f(S) = \frac{2m_S}{n_S}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.average_internal_degree()
avg_odf(**kwargs)

Average fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[\frac{1}{n_S} \sum_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>>
>>> communities = eva(g, alpha=alpha)
>>> pur = communities.purity()
conductance(**kwargs)

Fraction of total edge volume that points outside the algorithms.

\[f(S) = \frac{c_S}{2 m_S+c_S}\]

where \(c_S\) is the number of algorithms nodes and, \(m_S\) is the number of algorithms edges

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.conductance()
cut_ratio(**kwargs)

Fraction of existing edges (out of all possible edges) leaving the algorithms.

..math:: f(S) = frac{c_S}{n_S (n − n_S)}

where \(c_S\) is the number of algorithms nodes and, \(n_S\) is the number of edges on the algorithms boundary

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.cut_ratio()
edges_inside(**kwargs)

Number of edges internal to the algorithms.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.edges_inside()
erdos_renyi_modularity()

Erdos-Renyi modularity is a variation of the Newman-Girvan one. It assumes that vertices in a network are connected randomly with a constant probability \(p\).

\[Q(S) = \frac{1}{m}\sum_{c \in S} (m_S − \frac{mn_S(n_S −1)}{n(n−1)})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Erdos-Renyi modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.erdos_renyi_modularity()
References:

Erdos, P., & Renyi, A. (1959). On random graphs I. Publ. Math. Debrecen, 6, 290-297.

expansion(**kwargs)

Number of edges per algorithms node that point outside the cluster.

\[f(S) = \frac{c_S}{n_S}\]

where \(n_S\) is the number of edges on the algorithms boundary, \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.expansion()
f1(clustering)

Compute the average F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:F1 score (harmonic mean of precision and recall)
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.f1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth. In Complex Networks VII (pp. 133-144). Springer, Cham.
flake_odf(**kwargs)

Fraction of nodes in S that have fewer edges pointing inside than to the outside of the algorithms.

\[f(S) = \frac{| \{ u:u \in S,| \{(u,v) \in E: v \in S \}| < d(u)/2 \}|}{n_S}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\), \(d(u)\) is the degree of \(u\) and \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.flake_odf()
fraction_over_median_degree(**kwargs)

Fraction of algorithms nodes of having internal degree higher than the median degree value.

\[f(S) = \frac{|\{u: u \in S,| \{(u,v): v \in S\}| > d_m\}| }{n_S}\]

where \(d_m\) is the internal degree median value

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.fraction_over_median_degree()
get_description(parameters_to_display=None, precision=3)

Return a description of the clustering, with the name of the method and its numeric parameters.

Parameters:
  • parameters_to_display – parameters to display. By default, all float parameters.
  • precision – precision used to plot parameters. default: 3
Returns:

a string description of the method.

internal_edge_density(**kwargs)

The internal density of the algorithms set.

\[f(S) = \frac{m_S}{n_S(n_S−1)/2}\]

where \(m_S\) is the number of algorithms internal edges and \(n_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.internal_edge_density()

Quality function designed for directed graphs with overlapping communities.

Returns:the link modularity score
Example:
>>> from cdlib import evaluation
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.link_modularity()
max_odf(**kwargs)

Maximum fraction of edges of a node of a algorithms that point outside the algorithms itself.

\[max_{u \in S} \frac{|\{(u,v)\in E: v \not\in S\}|}{d(u)}\]

where \(E\) is the graph edge set, \(v\) is a node in \(S\) and \(d(u)\) is the degree of \(u\)

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.max_odf()
modularity_density()

The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures. The idea of this metric is to include the information about algorithms size into the expected density of algorithms to avoid the negligence of small and dense communities. For each algorithms \(C\) in partition \(S\), it uses the average modularity degree calculated by \(d(C) = d^{int(C)} − d^{ext(C)}\) where \(d^{int(C)}\) and \(d^{ext(C)}\) are the average internal and external degrees of \(C\) respectively to evaluate the fitness of \(C\) in its network. Finally, the modularity density can be calculated as follows:

\[Q(S) = \sum_{C \in S} \frac{1}{n_C} ( \sum_{i \in C} k^{int}_{iC} - \sum_{i \in C} k^{out}_{iC})\]

where \(n_C\) is the number of nodes in C, \(k^{int}_{iC}\) is the degree of node i within \(C\) and \(k^{out}_{iC}\) is the deree of node i outside \(C\).

Returns:the modularity density score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.modularity_density()
References:

Li, Z., Zhang, S., Wang, R. S., Zhang, X. S., & Chen, L. (2008). Quantitative function for algorithms detection. Physical review E, 77(3), 036109.

newman_girvan_modularity()

Difference the fraction of intra algorithms edges of a partition with the expected number of such edges if distributed according to a null model.

In the standard version of modularity, the null model preserves the expected degree sequence of the graph under consideration. In other words, the modularity compares the real network structure with a corresponding one where nodes are connected without any preference about their neighbors.

\[Q(S) = \frac{1}{m}\sum_{c \in S}(m_S - \frac{(2 m_S + l_S)^2}{4m})\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms edges, \(l_S\) is the number of edges from nodes in S to nodes outside S.

Returns:the Newman-Girvan modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.newman_girvan_modularity()
References:

Newman, M.E.J. & Girvan, M. Finding and evaluating algorithms structure in networks. Physical Review E 69, 26113(2004).

nf1(clustering)

Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input. Works on overlapping/non-overlapping complete/partial coverage partitions.

Parameters:clustering – NodeClustering object
Returns:MatchingResult instance
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.nf1(leiden_communities)
Reference:
  1. Rossetti, G., Pappalardo, L., & Rinzivillo, S. (2016). A novel approach to evaluate algorithms detection internal on ground truth.
  2. Rossetti, G. (2017). : RDyn: graph benchmark handling algorithms dynamics. Journal of Complex Networks. 5(6), 893-912.
normalized_cut(**kwargs)

Normalized variant of the Cut-Ratio

\[: f(S) = \frac{c_S}{2m_S+c_S} + \frac{c_S}{2(m−m_S )+c_S}\]

where \(m\) is the number of graph edges, \(m_S\) is the number of algorithms internal edges and \(c_S\) is the number of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.normalized_cut()
normalized_mutual_information(clustering)

Normalized Mutual Information between two clusterings.

Normalized Mutual Information (NMI) is an normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by sqrt(H(labels_true) * H(labels_pred))

Parameters:clustering – NodeClustering object
Returns:normalized mutual information score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.normalized_mutual_information(leiden_communities)
omega(clustering)

Index of resemblance for overlapping, complete coverage, network clusterings.

Parameters:clustering – NodeClustering object
Returns:omega index
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.omega(leiden_communities)
Reference:
  1. Gabriel Murray, Giuseppe Carenini, and Raymond Ng. 2012. Using the omega index for evaluating abstractive algorithms detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization. Association for Computational Linguistics, Stroudsburg, PA, USA, 10-18.
overlapping_normalized_mutual_information_LFK(clustering)

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by Lancichinetti et al.

Parameters:clustering – NodeClustering object
Returns:onmi score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.overlapping_normalized_mutual_information_LFK(leiden_communities)
Reference:
  1. Lancichinetti, A., Fortunato, S., & Kertesz, J. (2009). Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11(3), 033015.
overlapping_normalized_mutual_information_MGH(clustering, normalization='max')

Overlapping Normalized Mutual Information between two clusterings.

Extension of the Normalized Mutual Information (NMI) score to cope with overlapping partitions. This is the version proposed by McDaid et al. using a different normalization than the original LFR one. See ref. for more details.

Parameters:
  • clustering – NodeClustering object
  • normalization – one of “max” or “LFK”. Default “max” (corresponds to the main method described in the article)
Returns:

onmi score

Example:
>>> from cdlib import evaluation, algorithms
>>> g = nx.karate_club_graph()
>>> louvain_communities = algorithms.louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> evaluation.overlapping_normalized_mutual_information_MGH(louvain_communities,leiden_communities)
:Reference:
  1. McDaid, A. F., Greene, D., & Hurley, N. (2011). Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515. Chicago
significance()

Significance estimates how likely a partition of dense communities appear in a random graph.

Returns:the significance score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.significance()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

size(**kwargs)

Size is the number of nodes in the community

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score

Example:

>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.size()
surprise()

Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.

According to the Surprise metric, the higher the score of a partition, the less likely it is resulted from a random realization, the better the quality of the algorithms structure.

Returns:the surprise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.surprise()
References:

Traag, V. A., Aldecoa, R., & Delvenne, J. C. (2015). Detecting communities using asymptotical surprise. Physical Review E, 92(2), 022816.

to_json()

Generate a JSON representation of the algorithms object

Returns:a JSON formatted string representing the object
to_node_community_map()

Generate a <node, list(communities)> representation of the current clustering

Returns:dict of the form <node, list(communities)>
triangle_participation_ratio(**kwargs)

Fraction of algorithms nodes that belong to a triad.

\[f(S) = \frac{ | \{ u: u \in S,\{(v,w):v, w \in S,(u,v) \in E,(u,w) \in E,(v,w) \in E \} \not = \emptyset \} |}{n_S}\]

where \(n_S\) is the set of algorithms nodes.

Parameters:summary – (optional, default True) if True, an overall summary is returned for the partition (min, max, avg, std); if False a list of community-wise score
Returns:a FitnessResult object/a list of community-wise score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.triangle_participation_ratio()
variation_of_information(clustering)

Variation of Information among two nodes partitions.

$$ H(p)+H(q)-2MI(p, q) $$

where MI is the mutual information, H the partition entropy and p,q are the algorithms sets

Parameters:clustering – NodeClustering object
Returns:VI score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> leiden_communities = algorithms.leiden(g)
>>> mod = communities.variation_of_information(leiden_communities)
Reference:
  1. Meila, M. (2007). Comparing clusterings - an information based distance. Journal of Multivariate Analysis, 98, 873-895. doi:10.1016/j.jmva.2006.11.013
z_modularity()

Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit. The concept of this version is based on an observation that the difference between the fraction of edges inside communities and the expected number of such edges in a null model should not be considered as the only contribution to the final quality of algorithms structure.

Returns:the z-modularity score
Example:
>>> from cdlib.algorithms import louvain
>>> g = nx.karate_club_graph()
>>> communities = louvain(g)
>>> mod = communities.z_modularity()
References:

Miyauchi, Atsushi, and Yasushi Kawase. Z-score-based modularity for algorithms detection in networks. PloS one 11.1 (2016): e0147805.

Edge Clustering
Overview
class EdgeClustering(communities, graph, method_name='', method_parameters=None, overlap=False)

Edge Clustering representation.

Parameters:
  • communities – list of communities
  • graph – a networkx/igraph object
  • method_name – community discovery algorithm name
  • method_parameters – configuration for the community discovery algorithm used
  • overlap – boolean, whether the partition is overlapping or not
get_description(parameters_to_display=None, precision=3)

Return a description of the clustering, with the name of the method and its numeric parameters.

Parameters:
  • parameters_to_display – parameters to display. By default, all float parameters.
  • precision – precision used to plot parameters. default: 3
Returns:

a string description of the method.

to_edge_community_map()

Generate a <edge, list(communities)> representation of the current clustering

Returns:dict of the form <edge, list(communities)>
to_json()

Generate a JSON representation of the algorithms object

Returns:a JSON formatted string representing the object
Methods
Data transformation and IO
EdgeClustering.to_json() Generate a JSON representation of the algorithms object
EdgeClustering.to_edge_community_map() Generate a <edge, list(communities)> representation of the current clustering
Temporal Clustering

TemporalClustering models communities that evolves as time goes by.

Each temporal community clustering observation is a Clustering object, thus it inherits all properties of its specific concrete class.

Overview
class TemporalClustering
add_clustering(clustering, time)

Add to the Temporal Clustering the communities observed at a given time

Parameters:
  • clustering – a Clustering object
  • time – time of observation
add_matching(matching)

Add a precomputed matching of the communities.

Parameters:matching – a list of tuples [(Ti_Ca, Tj_Cb, score), … ]. Community names needs to satisfy the pattern {tid}_{cid}, where tid is the time of observation and cid is the position of the community within the Clustering object.
clustering_stability_trend(method)

Returns the trend for community stability. The stability index is computed for temporally adjacent clustering pairs.

Parameters:method – a comparison score taking as input two Clustering objects (e.g., NMI, NF1, ARI…)
Returns:a list of floats
community_matching(method, two_sided=False)

Reconstruct community matches across adjacent observations using a provided similarity function.

Parameters:
  • method – a set similarity function with co-domain in [0,1] (e.g., Jaccard)
  • two_sided – boolean. Whether the match has to be applied only from the past to the future (False, default) or even from the future to the past (True)
Returns:

a list of tuples [(Ti_Ca, Tj_Cb, score), … ]. Community names are assigned following the pattern {tid}_{cid}, where tid is the time of observation and cid is the position of the community within the Clustering object.

get_clustering_at(time)

Returns the clustering observed at a given time

Parameters:time – the time of observation
Returns:a Clustering object
get_community(cid)

Returns the nodes within a given temporal community

Parameters:cid – community id of the form {tid}_{cid}, where tid is the time of observation and cid is the position of the community within the Clustering object.
Returns:list of nodes within cid
get_explicit_community_match()

Return an explicit matching of computed communities (if it exists)

Returns:a list of tuple [(Ti_Ca, Tj_Cb, score), … ]. Community names are assigned following the pattern {tid}_{cid}, where tid is the time of observation and cid is the position of the community within the Clustering object.
get_observation_ids()

Returns the list of temporal ids for the available clusterings :return: a list of temporal ids

has_explicit_match()

Checks if the algorithm provided an explicit match of temporal communities

Returns:a list of tuple [(Ti_Ca, Tj_Cb, score), … ]. Community names are assigned following the pattern {tid}_{cid}, where tid is the time of observation and cid is the position of the community within the Clustering object.
lifecycle_polytree(method=None, two_sided=False)

Reconstruct the poly-tree representing communities lifecycles using a provided similarity function.

Parameters:
  • method – a set similarity function with co-domain in [0,1] (e.g., Jaccard)
  • two_sided – boolean. Whether the match has to be applied only from the past to the future (False, default) or even from the future to the past (True)
Returns:

a networkx DiGraph object. Nodes represent communities, their ids are assigned following the pattern {tid}_{cid}, where tid is the time of observation and cid is the position of the community within the Clustering object.

to_json()

Generate a JSON representation of the TemporalClustering object

Returns:a JSON formatted string representing the object
Methods
Data transformation and IO
TemporalClustering.to_json() Generate a JSON representation of the TemporalClustering object
TemporalClustering.get_observation_ids() Returns the list of temporal ids for the available clusterings :return: a list of temporal ids
TemporalClustering.get_clustering_at(time) Returns the clustering observed at a given time
TemporalClustering.add_clustering(…) Add to the Temporal Clustering the communities observed at a given time
TemporalClustering.get_community(cid) Returns the nodes within a given temporal community
Evaluating Node Clustering
TemporalClustering.clustering_stability_trend(method) Returns the trend for community stability.
Matching temporal clustering
TemporalClustering.community_matching(method) Reconstruct community matches across adjacent observations using a provided similarity function.
TemporalClustering.lifecycle_polytree([…]) Reconstruct the poly-tree representing communities lifecycles using a provided similarity function.

Community Discovery algorithms

CDlib collects implementations of several Community Discovery algorithms.

To maintain the library organization as clean and resilient as possible the approaches are grouped following a simple, two level, rationale:

  1. The first distinction is made on the object clustered, thus separating Node Clustering and Edge Clustering algorithms;
  2. The second distinction is made on the specific kind of partition each one of them generates: Crisp, Overlapping or Fuzzy.

This documentation follows the same rationale.

Node Clustering

Algorithms falling in this category generate communities composed by nodes. The communities can represent neat, crisp, partition as well as overlapping or even fuzzy ones.

Note

The following lists are aligned to CD methods available in the GitHub main branch of CDlib.

In particular, the current version of the library on pypl (that can be installed through pip) does not include the following algorithms: belief, ga.

Crisp Communities

A clustering is said to be a partition if each node belongs to one and only one community. Methods in this subclass return as result a NodeClustering object instance.

agdl(g_original, number_communities, kc) AGDL is a graph-based agglomerative algorithm, for clustering high-dimensional data.
aslpaw(g_original) ASLPAw can be used for disjoint and overlapping community detection and works on weighted/unweighted and directed/undirected networks.
async_fluid(g_original, k) Fluid Communities (FluidC) is based on the simple idea of fluids (i.e., communities) interacting in an environment (i.e., a non-complete graph), expanding and contracting.
belief(g_original[, max_it, eps, …]) Belief community seeks the consensus of many high-modularity partitions.
cpm(g_original[, initial_membership, …]) CPM is a model where the quality function to optimize is:
chinesewhispers(g_original[, weighting, …]) Fuzzy graph clustering that (i) creates an intermediate representation of the input graph, which reflects the “ambiguity” of its nodes, and (ii) uses hard clustering to discover crisp clusters in such “disambiguated” intermediate graph.
der(g_original[, walk_len, threshold, …]) DER is a Diffusion Entropy Reducer graph clustering algorithm.
edmot(g_original[, component_count, cutoff]) The algorithm first creates the graph of higher order motifs.
eigenvector(g_original) Newman’s leading eigenvector method for detecting community structure based on modularity.
em(g_original, k) EM is based on based on a mixture model.
ga(g_original[, population, generation, r]) Genetic based approach to discover communities in social networks.
gdmp2(g_original[, min_threshold]) Gdmp2 is a method for identifying a set of dense subgraphs of a given sparse graph.
girvan_newman(g_original, level) The Girvan–Newman algorithm detects communities by progressively removing edges from the original graph.
greedy_modularity(g_original[, weight]) The CNM algorithm uses the modularity to find the communities strcutures.
infomap(g_original) Infomap is based on ideas of information theory.
label_propagation(g_original) The Label Propagation algorithm (LPA) detects communities using network structure alone.
leiden(g_original[, initial_membership, weights]) The Leiden algorithm is an improvement of the Louvain algorithm.
louvain(g_original[, weight, resolution, …]) Louvain maximizes a modularity score for each community.
markov_clustering(g_original[, expansion, …]) The Markov clustering algorithm (MCL) is based on simulation of (stochastic) flow in graphs.
rber_pots(g_original[, initial_membership, …]) rber_pots is a model where the quality function to optimize is:
rb_pots(g_original[, initial_membership, …]) Rb_pots is a model where the quality function to optimize is:
scan(g_original, epsilon, mu) SCAN (Structural Clustering Algorithm for Networks) is an algorithm which detects clusters, hubs and outliers in networks.
significance_communities(g_original[, …]) Significance_communities is a model where the quality function to optimize is:
spinglass(g_original) Spinglass relies on an analogy between a very popular statistical mechanic model called Potts spin glass, and the community structure.
surprise_communities(g_original[, …]) Surprise_communities is a model where the quality function to optimize is:
walktrap(g_original) walktrap is an approach based on random walks.
sbm_dl(g_original[, B_min, B_max, deg_corr]) Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models.
sbm_dl_nested(g_original[, B_min, B_max, …]) Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models.
Overlapping Communities

A clustering is said to be overlapping if any generic node can be assigned to more than one community. Methods in this subclass return as result a NodeClustering object instance.

angel(g_original, threshold[, …]) Angel is a node-centric bottom-up community discovery algorithm.
big_clam(g_original[, dimensions, …]) BigClam is an overlapping community detection method that scales to large networks.
conga(g_original, number_communities) CONGA (Cluster-Overlap Newman Girvan Algorithm) is an algorithm for discovering overlapping communities.
congo(g_original, number_communities[, height]) CONGO (CONGA Optimized) is an optimization of the CONGA algortithm.
danmf(g_original[, layers, pre_iterations, …]) The procedure uses telescopic non-negative matrix factorization in order to learn a cluster memmbership distribution over nodes.
demon(g_original, epsilon[, min_com_size]) Demon is a node-centric bottom-up overlapping community discovery algorithm.
ego_networks(g_original[, level]) Ego-networks returns overlapping communities centered at each nodes within a given radius.
egonet_splitter(g_original[, resolution]) The method first creates the egonets of nodes.
kclique(g_original, k) Find k-clique communities in graph using the percolation method.
lais2(g_original) LAIS2 is an overlapping community discovery algorithm based on the density function.
lemon(g_original, seeds[, min_com_size, …]) Lemon is a large scale overlapping community detection method based on local expansion via minimum one norm.
lfm(g_original, alpha) LFM is based on the local optimization of a fitness function.
multicom(g_original, seed_node) MULTICOM is an algorithm for detecting multiple local communities, possibly overlapping, by expanding the initial seed set.
nmnf(g_original[, dimensions, clusters, …]) The procedure uses joint non-negative matrix factorization with modularity based regul;arization in order to learn a cluster memmbership distribution over nodes.
nnsed(g_original[, dimensions, iterations, seed]) The procedure uses non-negative matrix factorization in order to learn an unnormalized cluster membership distribution over nodes.
node_perception(g_original, threshold, …) Node perception is based on the idea of joining together small sets of nodes.
overlapping_seed_set_expansion(g_original, seeds) OSSE is an overlapping community detection algorithm optimizing the conductance community score The algorithm uses a seed set expansion approach; the key idea is to find good seeds, and then expand these seed sets using the personalized PageRank clustering procedure.
percomvc(g_original) The PercoMVC approach composes of two steps.
slpa(g_original[, t, r]) SLPA is an overlapping community discovery that extends tha LPA.
wCommunity(g_original[, min_bel_degree, …]) Algorithm to identify overlapping communities in weighted graphs
Fuzzy Communities

A clustering is said to be a fuzzy if each node can belongs (with a different degree of likelihood) to more than one community. Methods in this subclass return as result a FuzzyNodeClustering object instance.

frc_fgsn(g_original, theta, eps, r) Fuzzy-Rough Community Detection on Fuzzy Granular model of Social Network.
Node Attribute

Methods in this subclass return as result a AttrNodeClustering object instance.

eva(g_original, labels[, weight, …])
The Eva algorithm extends the Louvain approach in order to deal with the attributes of the nodes (aka Louvain Extended to Vertex Attributes).
ilouvain(g_original, labels, id)
The I-Louvain algorithm extends the Louvain approach in order to deal only with the scalar attributes of the nodes.
Bipartite Graph Communities

Methods in this subclass return as result a BiNodeClustering object instance.

bimlpa(g_original[, theta, lambd]) BiMLPA is designed to detect the many-to-many correspondence community in bipartite networks using multi-label propagation algorithm.
CPM_Bipartite(g_original, …[, …]) CPM_Bipartite is the extension of CPM to bipartite graphs
infomap_bipartite(g_original) Infomap is based on ideas of information theory.
Antichain Communities

Methods in this subclass are designed to extract communities from Directed Acyclic Graphs (DAG) and return as result a NodeClustering object instance.

siblinarity_antichain(g_original[, …]) The algorithm extract communities from a DAG that (i) respects its intrinsic order and (ii) are composed of similar nodes.
Edge Clustering

Algorithms falling in this category generates communities composed by edges. They return as result a EdgeClustering object instance.

Note

The following lists are aligned to CD methods available in the GitHub main branch of CDlib.

hierarchical_link_community(g_original) HLC (hierarchical link clustering) is a method to classify links into topologically related groups.

Finally CDlib implements also time-aware algorithms (often referred as Dynamic Community Discovery approaches).

Temporal Clustering

Algorithms falling in this category generates communities that evolve as time goes by.

Temporal Trade-Off

Methods that detect communities at a given time based on the current topology of the graph and on previously found community structures.

tiles(dg[, obs]) TILES is designed to incrementally identify and update communities in stream graphs.

Ensemble Methods

Methods to automate the execution of multiple instances of community detection algorithm(s).

Configuration Objects

Ranges can be specified to automate the execution of a same method while varying (part of) its inputs.

Parameter allows to specify ranges for numeric parameters, while BoolParamter for boolean ones.

Parameter(name, start, end, step)
BoolParameter(name, value)
Multiple Instantiation

Two scenarios often arise when applying community discovery algorithms to a graph: 1. the need to compare the results obtained by a give algorithm while varying its parameters 2. the need to compare the multiple algorithms

cdlib allows to do so by leveraging, respectively, grid_execution and pool.

grid_execution(graph, method, parameters) Instantiate the specified community discovery method performing a grid search on the parameter set.
pool(graph, methods, configurations) Execute on a pool of community discovery internal on the input graph.

Evaluation

The evaluation of Community Discovery algorithms is not an easy task. CDlib implements two families of evaluation strategies:

  • Internal evaluation through quality scores
  • External evaluation through partitions comparison
Fitness Functions

Fitness functions allows to summarize the characteristics of a computed set of communities. CDlib implements the following quality scores:

avg_distance(graph, communities, **kwargs) Average distance.
avg_embeddedness(graph, communities, **kwargs) Average embeddedness of nodes within the community.
average_internal_degree(graph, community, …) The average internal degree of the community set.
avg_transitivity(graph, communities, **kwargs) Average transitivity.
conductance(graph, community, **kwargs) Fraction of total edge volume that points outside the community.
cut_ratio(graph, community, **kwargs) Fraction of existing edges (out of all possible edges) leaving the community.
edges_inside(graph, community, **kwargs) Number of edges internal to the community.
expansion(graph, community, **kwargs) Number of edges per community node that point outside the cluster.
fraction_over_median_degree(graph, …) Fraction of community nodes of having internal degree higher than the median degree value.
hub_dominance(graph, communities, **kwargs) Hub dominance.
internal_edge_density(graph, community, **kwargs) The internal density of the community set.
normalized_cut(graph, community, **kwargs) Normalized variant of the Cut-Ratio
max_odf(graph, community, **kwargs) Maximum fraction of edges of a node of a community that point outside the community itself.
avg_odf(graph, community, **kwargs) Average fraction of edges of a node of a community that point outside the community itself.
flake_odf(graph, community, **kwargs) Fraction of nodes in S that have fewer edges pointing inside than to the outside of the community.
scaled_density(graph, communities, **kwargs) Scaled density.
significance(graph, communities, **kwargs) Significance estimates how likely a partition of dense communities appear in a random graph.
size(graph, communities, **kwargs) Size is the number of nodes in the community
surprise(graph, communities, **kwargs) Surprise is statistical approach proposes a quality metric assuming that edges between vertices emerge randomly according to a hyper-geometric distribution.
triangle_participation_ratio(graph, …) Fraction of community nodes that belong to a triad.
purity(communities) Purity is the product of the frequencies of the most frequent labels carried by the nodes within the communities

Among the fitness function a well-defined family of measures is the Modularity-based one:

erdos_renyi_modularity(graph, communities, …) Erdos-Renyi modularity is a variation of the Newman-Girvan one.
link_modularity(graph, communities, **kwargs) Quality function designed for directed graphs with overlapping communities.
modularity_density(graph, communities[, lmbd]) The modularity density is one of several propositions that envisioned to palliate the resolution limit issue of modularity based measures.
newman_girvan_modularity(graph, communities, …) Difference the fraction of intra community edges of a partition with the expected number of such edges if distributed according to a null model.
z_modularity(graph, communities, **kwargs) Z-modularity is another variant of the standard modularity proposed to avoid the resolution limit.

Some measures will return an instance of FitnessResult that takes together min/max/mean/std values of the computed index.

FitnessResult(min, max, score, std)
Partition Comparisons

It is often useful to compare different graph partition to assess their resemblance (i.e., to perform ground truth testing). CDlib implements the following partition comparisons scores:

adjusted_mutual_information(first_partition, …) Adjusted Mutual Information between two clusterings.
adjusted_rand_index(first_partition, …) Rand index adjusted for chance.
f1(first_partition, second_partition) Compute the average F1 score of the optimal algorithms matches among the partitions in input.
nf1(first_partition, second_partition) Compute the Normalized F1 score of the optimal algorithms matches among the partitions in input.
normalized_mutual_information(…) Normalized Mutual Information between two clusterings.
omega(first_partition, second_partition) Index of resemblance for overlapping, complete coverage, network clusterings.
overlapping_normalized_mutual_information_LFK(…) Overlapping Normalized Mutual Information between two clusterings.
overlapping_normalized_mutual_information_MGH(…) Overlapping Normalized Mutual Information between two clusterings.
variation_of_information(first_partition, …) Variation of Information among two nodes partitions.

Some measures will return an instance of MatchingResult that takes together mean and standard deviation values of the computed index.

MatchingResult(score, std)

Input-Output

Functions to save/load CDlib communities to/from file.

CSV format

The easiest way to save the result of a community discovery algorithm is to organize it in a .csv file. The following methods allows to read/write communities to/from csv.

read_community_csv(path[, delimiter, …]) Read community list from comma separated value (csv) file.
write_community_csv(communities, path[, …]) Save community structure to comma separated value (csv) file.

Note

CSV formatting allows only to save/retrieve NodeClustering object loosing most of the metadata present in the CD computation result - e.g., algorithm name, parameters, coverage…

JSON format

JSON format allows to store/load community discovery algorithm results in a more comprehensive way.

read_community_json(path[, zip]) Read community list from JSON file.
write_community_json(communities, path[, zip]) Generate a JSON representation of the clustering object

Note

JSON formatting allows only to save/retrieve all kind of Clustering object maintaining all their metadata - except for the graph object instance.

Visual Analytics

At the end of the analytical process is it often useful to visualize the obtained results. CDlib provides a few built-in facilities to ease such task.

Network Visualization

Visualizing a graph is always a good idea (if its size is reasonable).

plot_network_clusters(graph, partition[, …]) Plot a graph with node color coding for communities.
plot_community_graph(graph, partition[, …]) Plot a algorithms-graph with node color coding for communities.
Analytics plots

Community evaluation outputs can be easily used to generate a visual representation of the main partition characteristics.

plot_sim_matrix(clusterings, scoring) Plot a similarity matrix between a list of clusterings, using the provided scoring function.
plot_com_stat(com_clusters, com_fitness) Plot the distribution of a property among all communities for a clustering, or a list of clusterings (violin-plots)
plot_com_properties_relation(com_clusters, …) Plot the relation between two properties/fitness function of a clustering
plot_scoring(graphs, ref_partitions, …[, …]) Plot the scores obtained by a list of methods on a list of graphs.

Remote Datasets

cdlib allows to retrieve existing datasets, along with their ground truth partitions (if available), from an ad-hoc remote repository.

available_networks() List the remotely available network datasets.
available_ground_truths() List the remotely available network ground truth datasets.
fetch_network_data([net_name, net_type]) Load the required network from the remote repository
fetch_ground_truth_data([net_name, graph]) Load the required ground truth clustering from the remote repository
fetch_network_ground_truth([net_name, net_type]) Load the required network, along with its ground truth partition, from the remote repository.

Utilities

CDlib exposes a few utilities to manipulate graph objects generated with igraph and networkx.

Graph Transformation

Transform igraph to/from networkx objects.

convert_graph_formats(graph, desired_format) Converts from/to networkx/igraph
Identifier mapping

Remapping of graph nodes. It is often a good idea - to limit the memory usage - to use progressive integers as node labels. CDlib automatically - and transparently - makes the conversion for the user, however, this step can be costly: for such reason the library also exposes facilities to directly pre/post process the network/community data.

nx_node_integer_mapping(graph) Maps node labels from strings to integers.
remap_node_communities(communities, node_map) Apply a map to the obtained communities to retreive the original node labels

Bibliography

CDlib was developed for research purposes. Here you can find the complete list of papers that contributed to the algorithms and methods it exposes.

Algorithms

Evaluation measures

Researches using CDlib

So far it has been used to support the following research activities:

  • Hubert, M. Master Thesis. (2020) Crawling and Analysing code review networks on industry and open source data
  • Pister, A., Buono, P., Fekete, J. D., Plaisant, C., & Valdivia, P. (2020). Integrating Prior Knowledge in Mixed Initiative Social Network Clustering. arXiv preprint arXiv:2005.02972.
  • Mohammadmosaferi, K. K., & Naderi, H. (2020). Evolution of communities in dynamic social networks: An efficient map-based approach. Expert Systems with Applications, 147, 113221.
  • Cazabet, Remy, Souaad Boudebza, and Giulio Rossetti. “Evaluating community detection algorithms for progressively evolving graphs.” arXiv preprint arXiv:2007.08635 (2020).
  • Citraro, Salvatore, and Giulio Rossetti. “Identifying and exploiting homogeneous communities in labeled networks.” Applied Network Science 5.1 (2020): 1-20.
  • Citraro, Salvatore, and Giulio Rossetti. “Eva: Attribute-Aware Network Segmentation.” International Conference on Complex Networks and Their Applications. Springer, Cham, 2019.
  • Rossetti, Giulio. “ANGEL: efficient, and effective, node-centric community discovery in static and dynamic networks.” Applied Network Science 5.1 (2020): 1-23.
  • Jaiswal, Rajesh, and Sheela Ramanna. “Detecting Overlapping Communities Using Distributed Neighbourhood Threshold in Social Networks.” International Joint Conference on Rough Sets. Springer, Cham, 2020.
  • Rossetti, Giulio. “Exorcising the Demon: Angel, Efficient Node-Centric Community Discovery.” International Conference on Complex Networks and Their Applications. Springer, Cham, 2019.