Blog

Google Summer of Code 2019

NumFOCUS Foundation Google Summer of Code 2018

MDAnalysis has been accepted as a sub-org of the NumFOCUS foundation, for Google Summer of Code 2019. If you are interested in working with us this summer as a student read the advice and links below and write to us on the mailing list.

We are looking forward to all applications from interested students (undergraduate and postgraduate).

The application window deadline is April 9, 2019 at 18:00 (UTC). As part of the application process you must familiarize yourself with Google Summer of Code 2019.

Apply as soon as possible; the application window opens on March 25, 2019 but you should start engaging with us sooner than that: read on.

Project Ideas

We have listed several possible projects for you to work on on our wiki.

Alternatively if you have your own idea about a potential project we’d love to work with you to develop this idea; please write to us on the developer list to discuss it there.

Information for Students

You must meet our own requirements if you want to be a student with MDAnalysis this year (read all the docs behind these links!). You must also meet the eligibility criteria.

The MDAnalysis community values diversity and is committed to providing a productive, harassment-free to every member. Our Code of Conduct explains the values that we as a community uphold. Every member (and every GSoC student) agrees to follow the Code of Conduct.

As a start to get familiar with MDAnalysis and open source development you should follow these steps:

Complete the Tutorial

We have a tutorial explaining the basics of MDAnalysis. You should go through the tutorial at least once to understand how MDAnalysis is used.

Introduce yourself to us

Introduce yourself on the mailing list. Tell us what you plan to work on during the summer or what you have already done with MDAnalysis.

Close an issue of MDAnalysis

You must have at least one commit in the development branch of MDAnalysis in order to be eligible, i.e.. you must demonstrate that you have been seriously engaged with the MDAnalysis project.

We have a list of easy bugs to work on in our issue tracker on GitHub. We also appreciate if you write more tests or update/improve our documentation. To start developing for MDAnalysis have a look at our guide for developers and write us on the mailing list if you have more questions about setting up a development environment.

@richardjgowers, @jbarnoud, @micaela-matta, @orbeckst

Release 0.19.0, 0.19.1, 0.19.2

A new version of MDAnalysis has been released! Or rather, three new releases in quick succession, so we just talk about all of them in a single post.

This version brings a multitude of fixes, deprecations, and new features including exciting additions from one of our two 2018 Google Summer of Code students (other cool new features from GSOC2018 will be unveiled in the next release…) and one NSF REU student as well as full Windows support. Some highlights are given below, whilst the release notes list all the changes in this version.

GSoC 2018: Capped Distance Searches

GSoC student Ayush Suhane (@ayushsuhane) worked on integrating faster distance search algorithms (such as the grid search in the new MDAnalysis.lib.nsgrid neighbor search library and periodic KDTrees) for limited distance searches. His work improved many distance search based analysis methods like the calculation of radial distribution functions. There are now two new low-level functions MDAnalysis.lib.distances.capped_distance and MDAnalysis.lib.distances.self_capped_distance, which find all pairwise distances between particles up to a given maximum distance. By specifying a maximum distance to the search, it is possible to optimize the search, leading to greatly improved performance especially in larger systems.

For example to find all contacts between oxygen and hydrogen up to 5.0 Å

from MDAnalysisTests.datafiles import PSF, DCD
import MDAnalysis as mda
from MDAnalysis.lib.distances import capped_distance

u = mda.Universe(PSF, DCD)

oxy = u.select_atoms('name O*')
hyd = u.select_atoms('name H*')

idx, dists = capped_distance(oxy.positions, hyd.positions, box=u.dimensions, max_cutoff=5.0)

Unlike distance_array, which returns a matrix of every pairwise distance, here a sparse representation is returned, where idx is a (n, 2) array of the indices of the atoms and dists is the distance between these atoms.

For full details on the implementation of this and the expected performance improvements see Ayush’s post on faster distance search algorithms.

Analysis improvements

NSF REU student Henry Mull implemented a new analysis module MDAnalysis.analysis.dihedrals which includes analysis classes for

  • fast dihedral angle calculation (useful for featurization and dimensionality reduction in dihedral space as well as conformational analysis),
  • Ramachandran analysis (protein backbone dihedrals) including functions and data to plot allowed and generously allowed regions,
  • Janin analysis (protein sidechain dihedrals)

Ramachandran plot

Irfan Alibay improved the density_from_Universe() function, which now allows the user to exactly specify the region in which a density should be calculated. In this way, it becomes easier to calculate densities on identical grids for different simulations so that these densities can be compared more easily.

Shujie Fan added site specific radial distribution function analysis. The InterRDF_s() function calculates the radial distribution function relative to a single or a few particles (a “site”). The function helps with the analysis of the coordination of ions and ligands in binding sites of proteins or other biomolecules, for instance, the distribution of oxygen ligand atoms around sodium ions. Importantly, many of these sites can be computed at the same time, which improves performance because the most time consuming part of almost all analysis tasks, the loading of the trajectory data from disk into memory, only has to be done once, and then multiple computations can be performed with the data in memory.

Windows support

Since 0.19.2, Windows is fully supported under Python 3.4+ (Python 2.7 is not officially supported because of technical difficulties, which we decided not to address because of limited developer time and the fact that Python 3 is now the recommended version of Python).

For Windows we recommend the conda installation. If you want to install with pip or from source then you will need the full Microsoft developer environment with Microsoft Visual C++ 14.0.

Improvements to file readers

  • For the LAMMPS dump reader one can now pass an atom_style keyword argument to specify what is on each line, for instance, atom_style = 'id type charge x y z'. This provides much greater flexibility to read different dump files.
  • The ChainReader with argument continuous=True can now correctly handle continuous XTC or TRR trajectories that were split into multiple files with Gromacs gmx mdrun -noappend; in particular it will handle overlapping frames in such a way that only the last generated frame is included and the trajectory contains strictly monotonously increasing time steps. This means, Gromacs folks will never have to manually concatenate their trajectories for use with MDAnalysis.
  • When slicing a trajectory as with traj = u.trajectory[start:stop:step] then this returns now a trajectory slice object (an iterable), which can be passed around and then iterated over: essentially, it’s a trajectory that knows that it should only deliver a subset of frames. Functions like len(traj) are also fast (O(1)), very similar to how Python 3’s range() object functions. See FrameIterators for more details.
  • The MemoryReader is now a full first-class citizen in MDAnalysis, which can keep coordinates and velocities, forces, box dimensions in memory. It is a versatile swiss-army knife, both for fast analysis and also visualization with nglview (e.g., fit trajectory in memory, then visualize).
  • new hybrid36 PDB-like format
  • The Gromacs TPR parser now reads SETTLE constraints.
  • The Amber top parser reads bonds/angles/dihedrals.

Miscellaneous performance improvements

  • guess_bonds (thanks to above methods)
  • geometry selections faster (thanks to above)
  • make_whole (C++ rewrite)
  • fragment finding (C++ rewrite)
  • improvements to AtomGroup internals

Deprecations

This release brings a few new deprecations as the package heads towards a final API:

  • start/stop/step are deprecated in the initialization of Analysis classes. These parameters should instead be given to the run() method of the class.
  • Almost all “save()”, “save_results()”, “save_table()” methods in analysis classes.
  • Deprecated use of core.flags, the deprecation messages for this give advice on how to replace this functionality.
  • Default filename directory of align.AlignTraj is deprecated and will change in 1.0 to the current directory.

Author statistics

Altogether this represents the work of 16 contributors from around the world, and featured the work of five new contributors:

Upgrading to MDAnalysis version 0.19.x

To get all these features run either conda update -c conda-forge mdanalysis or pip install --upgrade MDAnalysis.

— The MDAnalysis Team

GSOC 2018: Improvements in distance search methods

We are pleased to announce another successful year of Google Summer of Code with the NumFOCUS organization, thanks to Richard Gowers and Jonathan Barnoud for mentoring the GSoC students. This year one of the projects was to improve the performance of pairwise distance computations, which is used quite frequently in MDAnalysis in different forms.

MDAnalysis v0.19.0 and higher now include the new functions MDAnalysis.lib.distances.capped_distance and MDAnalysis.lib.distances.self_capped_distance which offer a much faster way to calculate all pairwise distances up to a certain maximum distance. By only considering distances up to a certain maximum, we can use various algorithms to optimise the number of pairwise comparisons that are performed. Behind the scenes, these functions are using one of three different algorithms: bruteforce which is a naive pairwise distance calculation algorithm, pkdtree which is a wrapper method around Scipy’s KD tree search algorithm and nsgrid which is an implementation of cell-list algorithm. This last algorithm uses the new MDAnalysis.lib.nsgrid module which was implemented with the help of Sebastien Buchoux.

For more information on these algorithms the reader is encouraged to read @ayushsuhane’s blog, which includes a comparison of these approaches and their performance in different conditions.

The GSoC project

One of the major bottleneck in various analysis routines in MDAnalysis (and typically in molecular dynamics studies) is the evaluation of pairwise distances among the particles.

The primary problem revolves around fixed radius neighbor search algorithms. MDAnalysis offers a suite of algorithms including brute force method, tree-based binary search algorithms to solve such problems. While these methods are suitable for a variety of analysis functions using pairwise distances in MDAnalysis, one of the question was whether one can improve the performance of distance calculations using other established neighbor search methods.

This question led to the inception of a Google Summer of Code project with NumFOCUS. Ayush Suhane completed the project and was able to demonstrate performance improvements for specific cases of distance selections, identification of bonds, and radial distribution function in the analysis module of MDAnalysis. More details on the commit history, PR’s and blog posts can be found in the final report submitted to GSoC. Real-time benchmarks for specific modules in MDAnalysis can be found here.

The new capped_distance() function

The major highlight of the project is the introduction of MDAnalysis.lib.distances.capped_distance which allows automatic selection of methods based on predefined set of rules to evaluate pairs of atoms in the neighborhood of any particle. It allows a user-friendly interface for the developers to quickly implement any new algorithm throughout MDAnalysis modules. To test any new algorithm, one must comply with the following API:

def newmethod_capped(reference, configuration, max_cutoff, min_cutoff=None, box=None, return_distance=True):
    """
        An Algorithm to evaluate pairs between reference and configuration atoms
        and corresponding distances
    """
    return pairs, distances

Once the method is defined, register the function name in _determine_method in MDAnalysis.lib.distances as:

from MDAnalysis.lib import distances
distances_methods['newmethod'] = newmethod_capped

That’s it. The new method is ready to be tested across functions which use capped_distance. For any specific application, it can be called as capped_distance(ref, conf, max_dist, method='newmethod') from the function.

Treatment of periodic boundary conditions

While implementing any new algorithm for Molecular dynamics trajectories, one additional requirement is to handle the periodic boundary conditions. A combination of versatile function augment_coordinates and undo_augment can be used with any algorithm to handle PBC. The main idea is to extend the box by generating duplicate particles in the vicinity of the box by augment_coordinates. These duplicates, as well as the original particles, can now be used with any algorithm to evaluate the nearest neighbors. After the operation, the duplicate particles can be reverted back to their original particle indices using undo_augment. These functions are available in MDAnalysis.lib._augment. We encourage the interested readers to try different algorithms using these functions. Hopefully, you can help us improve the performance further with your feedback. As a starting point, the skeleton to enable PBC would take the following form:

def newmethod_search(coords, centers, radius, box=None):
    aug, mapping = augment_coordinates(coords, box, radius)
    all_coords = no.concatenate([coords, aug])
    """
        Perform operations for distance evaluations
        with **all_coords** using the new algorithm 
        and obtain the result in indices
    """
    indices = undo_augment(indices, mapping, len(coords))
    return indices

Finally, this function can be tested with capped_distance to check the performance against already implemented algorithms in MDAnalysis.

Performance improvements

As a final note, we managed to get a speed improvement of

  • ~ 2-3 times in Radial Distribution Function computation,
  • ~ 10 times in identification of bonds, and
  • ~ 10 times in distance based selections for the already existing benchmarks in MDAnalysis.

The performance is also found to improve with larger datasets but is not reported in benchmarks. Any motivated reader is welcome to submit their feedbacks about the performance of the above-mentioned functions on their data, and/or a benchmark which we would be happy to showcase to the world.

This was a flavor of what work was done during GSoC’18. Apart from performance improvements, it is envisioned that this internal functionality will reduce the burden from the user to understand all the technical details of distance search algorithms and instead allow a user to focus on their analysis, as well as allow future developers to easily implement any new algorithm which can exceed the present performance benchmarks.

Ayush Suhane, Richard Gowers