Blog

ENCORE ensemble similarity

We are happy to announce that the ENCORE ensemble similarity library has been integrated in the next version of MDAnalysis as MDAnalysis.analysis.encore.

ENCORE implements a variety of techniques for calculating similarity measures between structural ensembles in the form of trajectories, as described in:

Tiberti M, Papaleo E, Bengtsen T, Boomsma W, Lindorff-Larsen K (2015), ENCORE: Software for Quantitative Ensemble Comparison. PLoS Comput Biol 11(10): e1004415. doi:10.1371/journal.pcbi.1004415 .

The similarity measures are based on the same fundamental principle, i.e. estimating the probability density of conformational states of proteins from the available ensemble data and comparing such densities using measures of distance between probability distributions, such as the Jensen-Shannon divergence. ENCORE implements three similarity measures: HES (Harmonic ensemble similarity), CES (Clustering ensemble similarity) and DRES ( Dimensionality reduction ensemble similarity). In HES the structures of the ensembles are seen as samples from a multivariate normal distribution, whose parameters are estimated based on the available data. CES partitions the conformational space of all the ensembles in clusters and uses the relative occurrence of the ensembles in the clusters to estimate the probability density. DRES uses a kernel-density estimate from the ensembles, which is run on a dimensionally-reduced version of the conformational space.

The ENCORE package implements the similarity measures themselves together with a number of other algorithms and features, also available standalone. ENCORE implements:

  • the three similarity measures, HES, CES and DRES, each accessible with a single function call
  • trajectory convergence estimation based on CES and DRES, also accessible with a single function call
  • the Covariance estimators (Maximum likelihood and Shrinkage), Affinity Propagation clustering algorithm and the Stochastic proximity embedding dimensionality reduction method
  • plug-in interfaces that support the use of other clustering or dimensionality reduction algorithms from the scikit-learn package to be used instead of our own implementations, which makes it easy to switch between clustering and dimensionality reduction algorithms when using the ces and dres functions.
  • a parallel multi-core RMSD matrix calculator
  • tools to estimate the robustness of the methods via bootstrapping of the ensembles or similarity matrices. This can be useful to estimate the error associated with the size of the ensembles used as well as to assess the robustness of the implemented algorithms.

Details on implementation, use-cases and expected performance can be found in 10.1371/journal.pcbi.1004415.

The HES method is the fastest and least general of the three, as its performance depends on how well the probability of distribution underlying the ensembles can be modeled as a simple multivariate normal, which is not necessarily guaranteed for simulation trajectories. CES and DRES don’t rely on this assumption, however they both require the calculation of a full RMSD matrix for all the ensembles to be compared as well as clustering or dimensionality reduction, respectively, on the conformational space, and have thus higher requirements in terms of computation time and memory.

Using the similarity measures is simply a matter of loading the trajectories or experimental ensembles that one would like to compare as MDAnalysis.Universe objects:

>>> from MDAnalysis import Universe
>>> import MDAnalysis.analysis.encore as encore
>>> from MDAnalysis.tests.datafiles import PSF, DCD, DCD2
>>> u1 = Universe(PSF, DCD)
>>> u2 = Universe(PSF, DCD2)

and running the similarity measures on them, as for instance using the Harmonic Ensemble Similarity measure (encore.hes()):

>>> hes_similarities, details = encore.hes([u1, u2])
>>> print hes_similarities
[[        0.         38279683.9587939]
 [ 38279683.9587939         0.       ]]

Similarities are written in a square symmetric matrix having the same dimensions and ordering as the input list, with each element being the similarity value for a pair of the input ensembles. Other available measures are CES (encore.ces()) and DRES (encore.dres()). The details variable contains extra information about the calculation that has been performed: with HES, it contains the parameters of the estimated probability distributions; with CES, it contains the output of clustering; with DRES, it contains the embedded space.

The clustering and dimensionality reduction functionality is also directly available through the cluster and reduce_dimensionality functions.

For instance, to cluster the conformations from the two universes defined above, we can write:

>>> cluster_collection = encore.cluster([u1,u2])
>>> print cluster_collection
0 (size:5,centroid:1): array([ 0,  1,  2,  3, 98])
1 (size:5,centroid:6): array([4, 5, 6, 7, 8])
2 (size:7,centroid:12): array([ 9, 10, 11, 12, 13, 14, 15])

Here each cluster element is a conformation belonging to an ensemble; the cluster_collection object keeps track, for each element, both of the standard cluster membership information and of the ensemble it belongs to, making it possible to evaluate how the different trajectories are represented in each cluster.

By default ENCORE uses our implementation of the Affinity Propagation algorithm, but that can be changed as desired by the user to the others available in scikit-learn, which are automatically loaded into ENCORE if available.

For instance:

>>> cluster_collection =
    encore.cluster(
    [ens1,ens2],
    method=encore.DBSCAN())

in the same way, it is possible use dimensionality reduction algorithm other than the default Stochastic proximity embedding:

>>> coordinates, details =
    encore.reduce_dimensionality(
    [ens1,ens2],
    method=encore.PrincipalComponentAnalysis(dimension=2))

Similar options in encore.ces() and encore.dres() make it easy to change the algorithm that will be used by the methods on the fly.

For further details, see the documentation of the individual functions within ENCORE:

@mtiberti & @kain88-de

Support for MMTF has arrived in MDAnalysis!

Macromolecular Transmission Format support

The upcoming 0.16.0 release of MDAnalysis will have support for MMTF! MMTF is a new format designed to provide compact, efficient and fast browsing of the Protein Data Bank. Support for MMTF within MDAnalysis is offered through reading locally stored MMTF files, or through fetching the file directly from the PDB archive through the new MDAnalysis.fetch_mmtf function.

import MDAnalysis as mda

# Load a local file
u = mda.Universe('myfile.mmtf')

# Or download directly from PDB by providing PBD id
u = mda.fetch_mmtf('3J3Q')

The performance of loading MMTF files is a large improvement over traditional ascii PDB files, with the above system of approximately 2.4M atoms taking under 10 seconds to load. The compressed format and efficient algorithms for storing the data mean that downloading structures will also require much less bandwidth, making this possible even on slow connections.

MMTF files can support many different models for a given structure and this is made available through the .models attribute of a MDAnalysis Universe. This provides a list of AtomGroup objects each representing a different model. These models are able to each have a different topology.

from __future__ import print_function
import MDAnalysis as mda

u = mda.fetch_mmtf('4P3R')

print("This file has {} models".format(len(u.models)))

# Iterate over all models
for model in u.models:
    # analyse each model!

# Select atoms in a given molecule
ag = u.select_atoms('model 4 and name Ca')

Finally, full interoperability between different formats is provided in MDAnalysis, allowing MMTF files to be written to any of our supported formats. For example to download a MMTF file and write out to Gromacs GRO file:

import MDAnalysis as mda

u = mda.fetch_mmtf('4AKE')

u.atoms.write('4ake.gro')

Trying this out today

These features will all be in the upcoming 0.16.0 release of MDAnalysis, but if you can’t wait for that, it is also possible to install the latest development version. For full instructions on how to install the development version, see our guide on installing MDAnalysis from source.

@richardjgowers

Release 0.15.0

We have just released MDAnalysis version 0.15.0. This release contains new features as well as bug fixes. Highlights are listed below but for more details see the release notes.

We also had a lot of contributions from GSoC applicants. Thanks to our GSoC students @fiona-naughton and @jdetle as well as the other applicants @saxenauts, @endle, @abhinavgupta94 and @pedrishi.

Upgrade

You can upgrade with pip install --upgrade MDAnalysis

Noticable Changes

Deprecations in anticipation of new topology system

The next release (0.16.0) will bring a very big change to the internal workings of MDAnalysis. The topology system (how atom, residue, segment attributes are stored and manipulated) has been completely redesigned, giving many advantages and resolving a number of longstanding issues. The new system also brings speed (up to 40x faster) and memory (up to 60% less usage) improvements, and makes the core of MDAnalysis easier to maintain for the future. You can read more about this at the original issue, or see a short summary of the new system on the wiki.

In preparation for this change, we have introduced deprecation warnings for all components of the existing topology system suggesting the corresponding usage under the new system. Please adjust existing code as you encounter these warnings.

Revamped Contact Analysis

Contact Analysis has been completely rewritten in the new Contacts class. This class offers a standard native contact analysis as well as a contact analysis developed by Best & Hummer. A Q1-Q2 analysis is now available directly as q1q2. We have also made the Contacts extendable so that you can pass it your own cut-off functions, the q1q2 analysis is actually only a wrapper of Contacts that makes use of this flexibility. More information can be found in the documentation.

The old ContactAnalysis1 and ContactAnalysis classes will be removed in the next release.

RMSD Calculation

The rmsd function now doesn’t super position the given coordinates by default. The coordinates aren’t changed now by default, instead you can control it with the new center and superposition keywords.

PDB Format

Our own PDB parser has seen a lot of love in the last year. It has been the default for a long time now and all problems that occur for PDB’s are fixed only in this parser. Because of that we have removed the Biopython PDB parser. This means the permissive keyword argument for Universes isn’t used anymore.

We have also spent time tuning the performance of the PDB trajectory reader. Reading long PDB trajectories is now significantly faster. Your exact speed up depends on the length of the trajectory. For 1000 frames we have measured a speed up of 10000%. See PR #849 and Issue #848 for more details.

Additionally we have added .ent files to the list of supported PDB file formats.

Other Changes

A list of all changes can be found in the CHANGELOG.