ENCORE ensemble similarity
17 Dec 2016We are happy to announce that the ENCORE ensemble similarity library has been integrated in the next version of MDAnalysis as MDAnalysis.analysis.encore.
ENCORE implements a variety of techniques for calculating similarity measures between structural ensembles in the form of trajectories, as described in:
Tiberti M, Papaleo E, Bengtsen T, Boomsma W, Lindorff-Larsen K (2015), ENCORE: Software for Quantitative Ensemble Comparison. PLoS Comput Biol 11(10): e1004415. doi:10.1371/journal.pcbi.1004415 .
The similarity measures are based on the same fundamental principle, i.e. estimating the probability density of conformational states of proteins from the available ensemble data and comparing such densities using measures of distance between probability distributions, such as the Jensen-Shannon divergence. ENCORE implements three similarity measures: HES (Harmonic ensemble similarity), CES (Clustering ensemble similarity) and DRES ( Dimensionality reduction ensemble similarity). In HES the structures of the ensembles are seen as samples from a multivariate normal distribution, whose parameters are estimated based on the available data. CES partitions the conformational space of all the ensembles in clusters and uses the relative occurrence of the ensembles in the clusters to estimate the probability density. DRES uses a kernel-density estimate from the ensembles, which is run on a dimensionally-reduced version of the conformational space.
The ENCORE package implements the similarity measures themselves together with a number of other algorithms and features, also available standalone. ENCORE implements:
- the three similarity measures, HES, CES and DRES, each accessible with a single function call
- trajectory convergence estimation based on CES and DRES, also accessible with a single function call
- the Covariance estimators (Maximum likelihood and Shrinkage), Affinity Propagation clustering algorithm and the Stochastic proximity embedding dimensionality reduction method
- plug-in interfaces that support the use of other clustering or
dimensionality reduction algorithms from the scikit-learn package to be used instead of our own implementations,
which makes it easy to switch between clustering and dimensionality reduction
algorithms when using the
ces
anddres
functions. - a parallel multi-core RMSD matrix calculator
- tools to estimate the robustness of the methods via bootstrapping of the ensembles or similarity matrices. This can be useful to estimate the error associated with the size of the ensembles used as well as to assess the robustness of the implemented algorithms.
Details on implementation, use-cases and expected performance can be found in 10.1371/journal.pcbi.1004415.
The HES method is the fastest and least general of the three, as its performance depends on how well the probability of distribution underlying the ensembles can be modeled as a simple multivariate normal, which is not necessarily guaranteed for simulation trajectories. CES and DRES don’t rely on this assumption, however they both require the calculation of a full RMSD matrix for all the ensembles to be compared as well as clustering or dimensionality reduction, respectively, on the conformational space, and have thus higher requirements in terms of computation time and memory.
Using the similarity measures is simply a matter of loading the trajectories or experimental ensembles that one would like to compare as MDAnalysis.Universe objects:
>>> from MDAnalysis import Universe
>>> import MDAnalysis.analysis.encore as encore
>>> from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
>>> u1 = Universe(PSF, DCD)
>>> u2 = Universe(PSF, DCD2)
and running the similarity measures on them, as for instance using the Harmonic Ensemble Similarity measure (encore.hes()):
>>> hes_similarities, details = encore.hes([u1, u2])
>>> print hes_similarities
[[ 0. 38279683.9587939]
[ 38279683.9587939 0. ]]
Similarities are written in a square symmetric matrix having the same
dimensions and ordering as the input list, with each element being the
similarity value for a pair of the input ensembles. Other available measures
are CES (encore.ces())
and DRES (encore.dres()).
The details
variable contains extra information about the calculation that
has been performed: with HES, it contains the parameters of the estimated
probability distributions; with CES, it contains the output of clustering;
with DRES, it contains the embedded space.
The clustering and dimensionality reduction functionality is also directly
available through the cluster
and reduce_dimensionality
functions.
For instance, to cluster the conformations from the two universes defined above, we can write:
>>> cluster_collection = encore.cluster([u1,u2])
>>> print cluster_collection
0 (size:5,centroid:1): array([ 0, 1, 2, 3, 98])
1 (size:5,centroid:6): array([4, 5, 6, 7, 8])
2 (size:7,centroid:12): array([ 9, 10, 11, 12, 13, 14, 15])
…
Here each cluster element is a conformation belonging to an ensemble; the cluster_collection
object keeps track, for each element, both of the standard cluster membership information and of the ensemble it belongs to, making it possible to evaluate how the different trajectories are represented in each cluster.
By default ENCORE uses our implementation of the Affinity Propagation algorithm, but that can be changed as desired by the user to the others available in scikit-learn, which are automatically loaded into ENCORE if available.
For instance:
>>> cluster_collection =
encore.cluster(
[ens1,ens2],
method=encore.DBSCAN())
in the same way, it is possible use dimensionality reduction algorithm other than the default Stochastic proximity embedding:
>>> coordinates, details =
encore.reduce_dimensionality(
[ens1,ens2],
method=encore.PrincipalComponentAnalysis(dimension=2))
Similar options in encore.ces() and encore.dres() make it easy to change the algorithm that will be used by the methods on the fly.
For further details, see the documentation of the individual functions within ENCORE: