Blog

Release 0.16.0

We have just released MDAnalysis version 0.16.0. This release contains new features as well as bug fixes. Highlights are listed below but for more details see the release notes.

This is the biggest release we ever had with tons of new features. It includes a rewrite of the topology system, the work of our GSoC students Fiona Naughton (@fiona-naughton) and John Detlefs (@jdetle), a complete new ensemble analysis with encore and much more. In total 28 people contributed 1904 commits to this release. We closed 162 issues and merged 199 pull requests.

Upgrade

You can upgrade with pip install --upgrade MDAnalysis . If you use the conda package manager run conda update -c conda-forge mdanalysis

Noticable Changes

You can find a notebook with code example of the changes here.

Rewrite of our internal topology representation

This is the change we are most exited about for this release. It brings performance enhancements, makes maintaining the code easier, it’s easier to extend and allowed us a simplification of the interface. We have previously written about the details of new topology system.

But with this change also a lot of deprecated functions have been removed. For all of the deprecated functions the new replacements already exists and you will get a warning with a suggested change. The easiest way to check if your scripts will run without a problem after the update is to include the following on top of it.

import warnings
warnings.filterwarnings('always', module='MDAnalysis')

This will print a warning for every deprecated function you are using together with a short code snippet how your code has to be changed. If you do not want to upgrade your existing scripts we posted a guide how to use conda and python environments to run different versions of MDAnalysis on the same computer.

Attach arbitrary time series to your trajectories

Our GSoC student @fiona-naughton has implemented an auxillary reader to add arbitrary time series to a universe. The time series are kept in sync with the trajectory so it is possible to iterate through the trajectory and access the auxiliary data corresponding to the current time step.

import MDAnalysis as mda
from MDAnalysisTests.datafiles import PDB_sub_sol, XTC_sub_sol, XVG_BZ2

# Create your universe as usual
universe = mda.Universe(PDB_sub_sol, XTC_sub_sol)
# Attach an auxiliary time serie with the name `forces`
# In this example, the XVG file contains the force that applies to each atom.
universe.trajectory.add_auxiliary('forces', XVG_BZ2)
# Itarete through your trajectory, the time serie is kept in sync
for time_step in universe.trajectory:
    print(time_step.aux.forces)
# The first element of each array is the time in picoseconds.
# The next elements are the other columns of the XVG file.

@fiona-naugthon worked at offering several convenient way to iterate through your data. Read the documentation or Fiona’s blog posts to learn more about the feature.

This feature is still in its beginning and will be expanded in future releases. You can follow the conversation on the initial issue or on the pull request. So far, only the XVG format used by gromacs and grace are supported. Open an issue if you need support for other time series formats.

Do a dimension reduction with PCA and Diffusion Maps

@jdetle has implemented two new dimension reduction algorithms, Principal Component Analysis and Diffusion Maps. Both can be found in the analysis submodule. As an example lets look at the first two PCA dimensions of ADK from our test files.

import matplotlib.pyplot as plt
import MDAnalyis as mda
from MDAnalysis.analysis.pca import PCA
from MDAnalyisTests.datafiles import PSF, DCD

plt.style.use('ggplot')

u = mda.Universe(PSF, DCD)
pca = PCA(u, select='protein and name CA', verbose=True).run()
reduced_data = pca.transform(ca, n_components=2)

f, ax = plt.subplots()
ax.plot(d[:, 0], d[:, 1], 'o')
ax.set(xlabel=r'PC$_1$ [$\AA$]', ylabel=r'PC$_2$ [$\AA$]', title='PCA of ADK')

PCA projection

Convenience functions to create a new analysis

A while back we introduced a new frame work for analysis to unify the API for the different analysis methods we offer. With this release we also add a new class AnalysisFromFunction to make it easier to calculate observables from a simulation. Now code like this with a handwritten loop.

result = []
for ts in u.trajectory:
    result.append(u.atoms.center_of_geometry())
results = np.asarray(results)

Can now be converted into this.

from MDAnalyis.analysis.base import AnalysisFromFunction
cog = AnalysisFromFunction(lambda ag : ag.center_of_geometry(), u.atoms).run()
cog.results

This class also takes arguments to adjust the iteration (start,stop,step) and you can add verbosity with verbose=True . You will also profit from any performance improvements in the analysis class in the future without changing your code. If you have a specific observable that you want to calculate several times you can also create a new analysis class with analysis_class like this.

from MDAnalyis.analysis.base import analysis_class

def cog(ag):
    return ag.center_of_geometry()

COG = analysis_class(cog)

cog_results = COG(u.atoms, step=2, verbose=True).run()

Speed improvements in RMSD

Thanks for work from our NSF REU student @rbrtdlgd our RMSD calculations are about 40% faster now. If you are using the low-level qcprot algorithm yourself instead of our provided wrappers you have to change your code since the API has changed. For more see the release notes.

MemoryReader: Reading trajectories from memory

MDAnalysis typically reads trajectories from files on-demand, so that it can efficiently deal with large trajectories - even those that do not fit in memory. However, in some cases, both for convenience and for efficiency, it can be an advantage to work with trajectories directly in memory. In this release, we have introduced a MemoryReader, which makes this possible. This Reader has been originally implemented in the encore package.

The MemoryReader works with numpy arrays, using the same format as that used by for instance DCDReader.timeseries(). You can create a Universe directly from such an array:

import numpy as np
from MDAnalysis import Universe
from MDAnalysisTests.datafiles import DCD, PSF
from MDAnalysis.coordinates.memory import MemoryReader

# Create a Universe using a DCD reader
universe = Universe(PSF, DCD)

# Create a numpy array with random coordinates (100 frames) for the same topology
coordinates = np.random.uniform(size=(100, universe.atoms.n_atoms, 3)).cumsum(0)

# Create a new Universe directly from these coordinates
universe2 = Universe(PSF, coordinates, format=MemoryReader)

The MemoryReader will work just as any other reader. In particular, you can iterate over it as usual, or use the .timeseries() method to retrieve a reference to the raw array:

coordinates_fac = universe2.trajectory.timeseries(format='fac')

Certain operations can be speeded up by moving a trajectory to memory, and we have therefore added functionality to directly transfer any existing trajectory to a MemoryReader using Universe.transfer_to_memory:

universe = Universe(PSF, DCD)
# Switches to a MemoryReader representation
universe.transfer_to_memory() 

You can also do this directly upon construction of a Universe, by using the in_memory flag:

universe = Universe(PSF, DCD, in_memory=True)

Likewise, the AlignTraj class in the analysis/align.py module also has an in_memory flag, allowing it to do in-place alignments.

Others

We also blogged since the start of the year about features of the upcoming release.

Minor Enhancements

  • No more deprecation warning spam when MDAnalyis is imported
  • analysis.align has a new AlignTraj class following the analysis class style
  • all new analysis classes now print additional information with the verbose=True.
  • RMSD has been ported to the new analysis class style

Other Changes

A list of all changes can be found in the CHANGELOG.

Managing software versioning using Conda environments

Research projects can often take months to years to complete, but the precise version of software they use will often have a shorter lifetime than this. These new versions of software will often include new features which might be of great use, but they might also introduce changes which break your existing work and introduce compatibility issues with other pieces of software. So whilst for newer projects we might want to use the most up-to-date versions of software, for existing projects we want to be able to freeze the versions of software that are in use. This leads to us needing to install and manage multiple versions of software across our various research projects.

With the upcoming release of 0.16.0 of MDAnalysis, alongside various improvements, we are also introducing some changes which could break existing code. In this post we will explain how conda environments or Python virtual environments can be used to manage this transition, allowing you to finish existing projects with version 0.15.0, while also enjoying the benefits provided in version 0.16.0.

Conda Environments

Conda is a general package manager for scientific applications. It is mostly used for Python packages but the system can be used with any program. The conda-forge community also provides a large collection of scientific software for Python, R , Perl and others.

In this guide we will concentrate only on creating and managing environments with conda. For more general information on installing conda please refer to the official documentation.

Software is made available through different conda channels, which each act as a source for different software. When attempting to install packages into a conda environment, these channels are searched. In this post we will be using the conda-forge channel which you can add to your configuration like so:

conda config  --add channels conda-forge

For each research project, it is advised that you create a new environment so that the software used in each project does not interfere across different projects. To create a new environment for your next project that uses MDAnalysis in version 0.15.0 run:

conda create -n myproject mdanalysis=0.15.0 -y

This has created a new software environment called myproject but has not affected anything currently! To have access to all the software installed within it we must first activate it

source activate myproject

To list your available environments

conda env list

A nice feature of using conda-environments is that they are easy to share with colleagues or transferred to other computers. This allows all collaborators on a project to use an identical set of software and makes your research projects reproducible. To store the state of the environment we created in a file called myproject-environment

conda list --explicit --md5 -n myproject > myproject-environment

You can now copy this file to a colleague or onto another computer. The first 3 lines also contain instructions how this file can be used with conda.

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64

More information about conda environments can be found in the official documentation.

Python Virtual Environments

Like conda, virtual environments managed with virtualenv allow you to use different versions of python and python packages for your different project. On the contrary to conda, however, virtualenv is not a general-purpose package manager; it leverages what is available on your system, and let you install python packages using pip.

To use virtual environments you have to install the virtualenv package first. This can be done with either pip or the package manager of your system:

pip install virtualenv
# or on ubuntu
sudo apt install virtualenv
# or on fedora
sudo dnf install python-virtualenv

Virtual environments can be created per project directory.

cd myproject
# if you want to use the default python interpretor
virtualenv myproject-env
# or if you want to explicitly use python 2.7
virtualenv -p /usr/bin/python2.7 myproject-env
# or if you want to use python 3.5
virtualenv -p /usr/bin/python3.5 myproject-env
# or maybe you want to use pypy
virtualenv -p /usr/bin/pypy myproject-env

This will create a new folder myproject-env. This folder contains the virtual environment and all packages you have installed in it. To activate it in the current terminal run:

source myproject-env/bin/activate

Now you can install packages via pip without affecting your global environment. The packages you install when the environment is activated will be available in terminal sessions that have the environment activated. You can deactivate the virtual environment by running

deactivate

The virtualenvwrapper package makes virtual environment easier to use. It provides some very useful features:

  • it organizes the virtual environment in a single directory, avoiding to have them scattered throughout the file system;
  • it defines command to easy the creation, deletion, and copy of virtual environments;
  • it defines a command to activate a virtual environment using its name;
  • all commands defined by virtualenvwrapper have tab-completion for virtual environment names.

To use virtualenvwrapper you first need to install it outside of a virtual environment:

pip install virtualenvwrapper
# or on ubuntu
sudo apt install virtualenvwrapper
# or on fedora
sudo dnf install python-virtualenvwrapper

Then, you need to have it loaded in your terminal session. Add the following lines in ~/.bashrc, they will be executed every time you open a new terminal session:

# Decide where to store the virtual environments
export WORKON_HOME=~/Envs
# Make sure the directory exists
mkdir -p ${WORKON_HOME}
# Load virtualenvwrapper
source /usr/local/bin/virtualenvwrapper.sh

Open a new terminal or run source ~/.bashrc to update your session. You can now create a virtual environment with

mkvirtualenv my-project

Like the virtualenv command we saw earlier, mkvirtualenv lets you choose your python interpretor with the -p option. Regardless of your current working directory, we created the virtual environment in ~/Envs/ and it is now loaded in our terminal session.

You can load your virtual environments by running workon my-project, and exit them by running deactivate.

Virtual environments, especially with virtualenvwrapper, can do much more. The Hitchhikers Guide to Python has a good tutorial that gives a more in depth explanation of virtual environments. The virtualenvwrapper documentation is also a good resource to read.

A shiny, new and faster topology system

With MDAnalysis 0.16.0 on the horizon, we wanted to showcase a major development. In fall 2015, we (@richardjgowers and @dotsdl) set to work on redesigning the topology system from scratch. This system determines how atom, residue, and segment information is internally represented and exposed to everything in the API (Universe, AtomGroup, etc.), and the old scheme had issues with data duplication, maintaining consistency between atom and residue attributes, and performance for large systems. We hoped to resolve all of these issues with our new design.

The starting point of this work was (the now infamous) issue 363, which floated the idea of holding all atom, residue, and segment attributes in arrays instead of lists of Atom, Residue, and Segment objects. This approach turned the way topology data such as atom names, resids, masses, etc. are stored in a Universe on its head, going from an array of structs (list of Atom objects with individual attributes) to a struct of arrays (an array for each attribute, one entry per Atom).

Now, over a year later, the finishing touches on this work are being prepared for release. This post is meant to serve as a brief view to what has changed internally, what has changed externally, and what benefits this gives us looking forward to the future.

Invisible changes to make working with MDAnalysis faster

Most of the changes are (or should be) invisible to the user. But they made some of the most fundamental operations in MDAnalysis quite a bit faster. Although this section is mostly of interest to developers, it is useful for all users to know the operations that MDAnalysis can now do much faster than before (and why). In the new system, each atom is a member of exactly one residue, and each residue is a member of exactly one segment. The new Topology object keeps an array giving the residue membership of each atom, and likewise an array giving segment membership of each residue. Getting the resname of the residue of a group of atoms, then, is achieved by taking the indices of these atoms to fancy-index the Atoms->Residues array, and then using the result of this to fancy-index the Resnames array. For example, if the Topology has 5 atoms and 3 residues, with membership (Atoms->Residues) and Resnames arrays as below:

       Atoms->Residues           Resnames
 index ---------------     index --------
     0 0                       0 GLU
     1 2                       1 LYS
     2 1                       2 ALA
     3 1
     4 2

calling AtomGroup.resnames for an AtomGroup with atoms [2, 0, 1, 2] will yield (pseudocode):

"Atoms->Residues"[[2, 0, 1, 2]] --> [1, 0, 2, 1]
"Resnames"[[1, 0, 2, 1]]        --> ['LYS', 'GLU', 'ALA', 'LYS']

This scheme only works if each atom is a member of one and only one residue, and likewise if residues are members of one and only one segment. Furthermore, AtomGroups, ResidueGroups, and SegmentGroups are very thin, storing only the indices of their members as a numpy array. This gives a number of advantages:

  1. Performance. We get up to an 8x speedup over the old scheme when accessing attributes. Setting attributes can give up to a 40x speedup.
  2. Memory. We don’t store, for example, a resname for each atom, but instead store attributes at the level they make sense for.
  3. Consistency. Since attributes are stored in one place, we avoid cases where the topology is in an inconsistent state, e.g. two atoms in the same residue give a different resname.
  4. No staleness. Because e.g. ResidueGroups are only an array of indices, not a list of Residue objects generated upon creation of the group, changes of resiude-level properties by another ResidueGroup are always reflected consistently by every other one. Data is not duplicated anywhere in this scheme, and is all contained in the Topology object.
  5. Serialization. Topologies become serializable and changes to topologies can be easily saved and communicated around. This is an important step towards implementing parallel algorithms in MDAnalysis.

For further performance comparisons, check out this notebook.

External changes that may affect how you use MDAnalysis

Previously, every object except Atom subclassed from AtomGroup. This meant that calling .positions of would give you the positions of the Atoms contained within that group.

Previous class structure:

Atom

AtomGroup  -> Residue
           -> ResidueGroup -> Segment
                           -> SegmentGroup

New class structure:

Group    -> AtomGroup
         -> ResidueGroup
         -> SegmentGroup

Atom
Residue
Segment

Now each object only contains information pertaining to that particular object. A Residue object only yields information about the residue; to get to the atoms, use Residue.atoms. Similarly, to get the atoms from a Segment or a SegmentGroup use Segment.atoms or SegmentGroup.atoms. As before, you can get all residues associated with a group with Group.residues (which returns a ResidueGroup) and all segments with Group.segments (a SegmentGroup). Bottom line: you should now always be explicit about what you want.

Why this was changed

Previously everything inheriting from AtomGroup made it unclear at what level of topology a given method or attribute was working on. For example, does ResidueGroup.charges give the charge of the residues or the atoms? Also, it was unclear what size a given output would be (see issue 411).

How to work with the new system

To access atom-level information from anything that isn’t an AtomGroup, use the .atoms level accessor. For example, changing all .positions calls on anything that isn’t an AtomGroup to .atoms.positions.

Going forward: what does this mean for MDAnalysis as a project?

A major benefit of the new topology system is that information about the topology of a Universe is now completely encapsulated in the Topology object. This not only makes development and maintenance easier, but also opens the door to some exciting new possibilities as simulation systems grow larger. A single Topology object can now be cleanly shared by multiple Universe instances, each with their own trajectory reader(s). This could make common operations such as fitting a trajectory to a reference structure or doing parallel analysis of many trajectories more efficient for large systems. The Topology object can also be serialized more easily. This should enable parallelization on workers without shared memory (using libraries such as distributed) out-of-the-box.

Making these things work is an ongoing effort, but the MDAnalysis coredevs are working to take advantage of all these possibilities. We look forward to the benefits this brings not only to the project, but also to all our users going forward. We hope you like what we’ve done here.

@dotsdl and @richardjgowers

MDAnalysis

Original draft of the new topology system by @richardjgowers and @dotsdl, November 2015 at ASU.