Clustering

clustering frontend — mdaencore.clustering.cluster

The module defines a function serving as front-end for various clustering algorithms, wrapping them to allow them to be used interchangably.

mdaencore.clustering.cluster.cluster(ensembles, method=<mdaencore.clustering.ClusteringMethod.AffinityPropagationNative object>, select='name CA', distance_matrix=None, allow_collapsed_result=True, ncores=1, **kwargs)[source]

Cluster frames from one or more ensembles, using one or more clustering methods. The function optionally takes pre-calculated distances matrices as an argument. Note that not all clustering procedure can work directly on distance matrices, so the distance matrices might be ignored for particular choices of method.

Parameters:
  • ensembles (MDAnalysis.Universe, or list, or list of list thereof) – The function takes either a single Universe object, a list of Universe objects or a list of lists of Universe objects. If given a single universe, it simply clusters the conformations in the trajectory. If given a list of ensembles, it will merge them and cluster them together, keeping track of the ensemble to which each of the conformations belong. Finally, if passed a list of list of ensembles, the function will just repeat the functionality just described - merging ensembles for each ensemble in the outer loop.

  • method (encore.ClusteringMethod or list thereof, optional) – A single or a list of instances of the Clustering classes from the clustering module. A separate analysis will be run for each method. Note that different parameters for the same clustering method can be explored by adding different instances of the same clustering class.

  • select (str, optional) – Atom selection string in the MDAnalysis format. Default is “name CA”

  • distance_matrix (encore.utils.TriangularMatrix or list thereof, optional) – Distance matrix used for clustering. If this parameter is not supplied the matrix will be calculated on the fly. If several distance matrices are supplied, an analysis will be done for each of them. The number of provided distance matrices should match the number of provided ensembles.

  • allow_collapsed_result (bool, optional) – Whether a return value of a list of one value should be collapsed into just the value.

  • ncores (int, optional) – Maximum number of cores to be used (default is 1).

Returns:

  • list of ClustersCollection objects (or potentially a single

  • ClusteringCollection object if allow_collapsed_result is set to True)

Example

Two ensembles are created as Universe object using a topology file and two trajectories. The topology- and trajectory files used are obtained from the MDAnalysis test suite for two different simulations of the protein AdK. Here, we cluster two ensembles

>>> from MDAnalysis import Universe
>>> import mdaencore as encore
>>> from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
>>> ens1 = Universe(PSF, DCD)
>>> ens2 = Universe(PSF, DCD2)
>>> cluster_collection = encore.cluster([ens1,ens2])
>>> print cluster_collection

You can change the parameters of the clustering method by explicitly specifying the method

>>> cluster_collection =
        encore.cluster(
             [ens1,ens2],
             method=encore.AffinityPropagationNative(preference=-2.))

Here is an illustration using DBSCAN algorithm, instead of the default clustering method

>>> cluster_collection =
        encore.cluster(
             [ens1,ens2],
             method=encore.DBSCAN())

You can also combine multiple methods in one call

>>> cluster_collection =
        encore.cluster(
             [ens1,ens2],
             method=[encore.AffinityPropagationNative(preference=-1.),
                     encore.AffinityPropagationNative(preference=-2.)])

In addition to standard cluster membership information, the cluster_collection output keep track of the origin of each conformation, so you check how the different trajectories are represented in each cluster. Here, for brevity, we print just the members of the two first clusters

>>> print [cluster.metadata["ensemble_membership"]
             for cluster in cluster_collection][:2]
[array([1, 1, 1, 1, 2]), array([1, 1, 1, 1, 1])]

Cluster representation — mdaencore.clustering.ClusterCollection

The module contains the Cluster and ClusterCollection classes which are designed to store results from clustering algorithms.

class mdaencore.clustering.ClusterCollection.Cluster(elem_list=None, centroid=None, idn=None, metadata=None)[source]

Generic Cluster class for clusters with centroids.

Variables:
  • id (int) – Cluster ID number. Useful for the ClustersCollection class

  • metadata (iterable) – dict of lists or numpy.array, containing metadata for the cluster elements. The iterable must return the same number of elements as those that belong to the cluster.

  • size (int) – number of elements.

  • centroid (element object) – cluster centroid.

  • elements (numpy.array) – array containing the cluster elements.

class mdaencore.clustering.ClusterCollection.ClusterCollection(elements=None, metadata=None)[source]

Clusters collection class; this class represents the results of a full clustering run. It stores a group of clusters defined as encore.clustering.Cluster objects.

Variables:

clusters (list) – list of of Cluster objects which are part of the Cluster collection

get_centroids()[source]

Get the centroids of the clusters

Returns:

  • centroids (list of cluster element objects)

  • list of cluster centroids

get_ids()[source]

Get the ID numbers of the clusters

Returns:

  • ids (list of int)

  • list of cluster ids

clustering frontend — mdaencore.clustering.ClusteringMethod

The module defines classes for interfacing to various clustering algorithms. One has been implemented natively, and will always be available, while others are available only if scikit-learn is installed.

class mdaencore.clustering.ClusteringMethod.AffinityPropagation(damping=0.9, preference=-1.0, max_iter=500, convergence_iter=50, **kwargs)[source]

Interface to the Affinity propagation clustering procedure implemented in sklearn.

class mdaencore.clustering.ClusteringMethod.AffinityPropagationNative(damping=0.9, preference=-1.0, max_iter=500, convergence_iter=50, add_noise=True)[source]

Interface to the natively implemented Affinity propagation procedure.

class mdaencore.clustering.ClusteringMethod.ClusteringMethod[source]

Base class for any Clustering Method

class mdaencore.clustering.ClusteringMethod.DBSCAN(eps=0.5, min_samples=5, algorithm='auto', leaf_size=30, **kwargs)[source]

Interface to the DBSCAN clustering procedure implemented in sklearn.

class mdaencore.clustering.ClusteringMethod.KMeans(n_clusters, max_iter=300, n_init=10, init='k-means++', algorithm='auto', tol=0.0001, verbose=False, random_state=None, copy_x=True, **kwargs)[source]
accepts_distance_matrix = False

Interface to the KMeans clustering procedure implemented in sklearn.

mdaencore.clustering.ClusteringMethod.encode_centroid_info(clusters, cluster_centers_indices)[source]

Adjust cluster indices to include centroid information as described in documentation for ClusterCollection

Clustering algorithms

The following clustering algorithms are always available:

Cython wrapper for the C implementation of the Affinity Perturbation clustering algorithm.

mdaencore.clustering.affinityprop.AffinityPropagation(s, preference, lam, max_iterations, convergence, noise=1)

Affinity propagation clustering algorithm. This class is a Cython wrapper around the Affinity propagation algorithm, which is implement as a C library (see ap.c). The implemented algorithm is described in the paper:

Clustering by Passing Messages Between Data Points. Brendan J. Frey and Delbert Dueck, University of Toronto Science 315, 972–976, February 2007

Parameters:
  • s (encore.utils.TriangularMatrix object) – Triangular matrix containing the similarity values for each pair of clustering elements. Notice that the current implementation does not allow for asymmetric values (i.e. similarity(a,b) is assumed to be equal to similarity(b,a))

  • preference (numpy.array of floats or float) – Preference values, which the determine the number of clusters. If a single value is given, all the preference values are set to that. Otherwise, the list is used to set the preference values (one value per element, so the list must be of the same size as the number of elements)

  • lam (float) – Floating point value that defines how much damping is applied to the solution at each iteration. Must be ]0,1]

  • max_iterations (int) – Maximum number of iterations

  • convergence (int) – Number of iterations in which the cluster centers must remain the same in order to reach convergence

  • noise (int) – Whether to apply noise to the input s matrix, such there are no equal values. 1 is for yes, 0 is for no.

Returns:

elements – List of cluster-assigned elements, which can be used by encore.utils.ClustersCollection to generate Cluster objects. See these classes for more details.

Return type:

list of int or None