Background¶
We want to perform Path Similarity Analysis (PSA) [Seyler2015] on a relatively small data set of 400 molecular dynamics trajectories. PSA will group similar trajectories with each other and enables a quantitative comparison. 200 trajectories were produced with the DIMS MD method [Perilla2011] [Beckstein2009], which is based on molecular dynamics, and 200 were produced with FRODA [Farrell2010] , which is based on a geometric rigidity decomposition. We want to understand if these methods produce similar or different transition pathways, and, perhaps how they differ.
The PSA distance metric¶
PSA requires the calculation of the distances between all paths \(P_i\), the distance matrix
The MDAnalysis.analysis.psa
module in MDAnalysis contains
implementations for various distance functions \(\delta(P,
Q)\). Here we are going to use the discrete Fréchet distance as
implemented in
MDAnalysis.analysis.psa.discrete_frechet()
. Another good choice
is the Hausdorff distance
(MDAnalysis.analysis.psa.hausdorff()
), and it is straightforward
to alter the example code in this tutorial (namely,
mdanalysis_psa_partial.py) to explore the use of a
different distance function.
Computational cost¶
Because the path metric \(\delta(P, Q)\) is symmetric, the distance metric is also symmetric. Furthermore, \(\delta(P, P) = 0\). Therefore, we need to perform in total
calculations. For \(N=400\), this means 79,800 individual trajectory comparisons.
Approach¶
Individual path comparisons are relatively compute intensive (they scale with \(\mathcal{O}(n_i n_j)\), the product of the number of time frames in each trajectory. But they are all independent and hence a simple map-reduce strategy can be easily implemented whereby individual jobs compute sub-matrices of the distance matrix (map) and the results are then combined into the complete solution (reduce).
Block distance matrix¶
We choose a block size \(w\) and decompose the distance matrix \(\mathsf{D} = [D_{ij}],\ 0 \leq i, j < N\) into block matrices \(\mathsf{D}^{\alpha}\) with \(0 \leq \alpha < N/w\) and
(with the appropriate adjustments for blocks at the edges that contain less than \(w\) entries).
MDAnalysis¶
In MDAnalysis each trajectory is loaded into a
Universe
, using a topology file
(which contains static information about the atoms, bonds etc) and a
trajectory file (which contains the changing coordinates for each
time step):
import MDAnalysis as mda
universe = mda.Universe(topology, trajectory)
Because we need to access all trajectory data for our analysis, we can
increase performance by first loading the whole trajectory into memory
1 with the
transfer_to_memory()
method
2:
universe.transfer_to_memory()
We will restrict the calculation of the path distances to a subset of
atoms, namely the “C-alpha” atoms that are distributed along the
protein backbone and report on the larger conformational changes. The
C-alpha atoms can be selected as a
AtomGroup
with the atom selection
language:
ca = u.select_atoms("name CA")
In order to calculate a pairwise distance between two trajectories we extract the coordinates of all CA atoms for all time steps into a numpy array:
P = u.trajectory.timeseries(asel=ca, format="fac")
With the coordinates for a second trajectory Q
we can then
calculate the discrete Fréchet distance using an recursive dynamic
programming algorithm [EiterManilla1994], implemented in
MDAnalysis.analysis.psa.discrete_frechet()
:
from MDAnalysis.analysis import psa
dF = psa.discrete_frechet(P, Q)
dF
contains the discrete Fréchet distance \(\delta_F(P, Q)\),
a real non-negative number. Because we base the Fréchet distance on
the root mean square distance (RMSD) between the CA coordinates for
two frames in \(P\) and \(Q\) as its point-wise metric (see,
for instance, [Seyler2015] for more details), \(\delta_F(P, Q)\)
has the interpretation of the RMSD between the two frames in the two
trajectories that best characterize the difference between the two
trajectories (they form a Fréchet pair).
Note
For macromolecular systems, we typically remove all translational
and rotational degrees of freedoms for all trajectories by
superimposing all trajectory frames on a single reference
structure [Seyler2015]. The superposition can be carried out in a
pre-processing step using, for instance,
MDAnalysis.analysis.align.AlignTraj
or as part of PSA with
MDAnalysis.analysis.psa.PSAnalysis
. The trajectories for
this tutorial were already superimposed appropriately (on the
“CORE” domain of AdK, as described in more detail in
[Seyler2015].)
Calculating a full Fréchet distance matrix \(D_{ij} = \delta(P_i, P_j)\) just requires more book-keeping in order to perform the above steps for the Cartesian product of all trajectories \(P_i \times P_j\).
Radical.pilot¶
We use radical.pilot
to generate one compute unit for each
block matrix computation. The pilot job distributes the individual
compute units, which includes staging of input trajectories and
retrieval of the output file (the block matrix), as well as running
the MDAnalysis script that performs the calculation of the block
matrix on a compute node.
Footnotes
- 1
Instead of using
transfer_to_memory()
one could also simply set thein_memory=True
keyword argument ofUniverse
as shown for in-memory representation of arbitrary trajectories. However, here we keep the two steps separate for conceptual clarity.- 2
Loading trajectory data into memory makes use of the new
MemoryReader
functionality inMDAnalysis.coordinates.memory
; this functionality has been available since the 0.16.0 release.