Tutorial¶
We will first test that MDAnalysis can perform PSA, using the small test set. We will then look in detail into the radical.pilot script that implements the map-step. Finally, we conclude with the reduce-step and analysis of the data.
Preliminary test¶
Provide topology and trajectory files to the PSA script mdanalysis_psa_partial.py as two lists in a JSON file (see mdanalysis_psa_partial.py for more details). Initially we just want to check that the script can process the data
(mdaenv) $ $RPDIR/mdanalysis_psa_partial.py --inputfile testcase.json -n 5
This means
use the test case
compare the first 5 trajectories against the remaining 5 trajectories
The -n
(split) argument is important because we are going to use
it to decompose the full distance matrix into sub-matrices. (If you
just want to do all-vs-all comparisons, use the
mdanalysis_psa.py script.)
You should see output like
Loading paths from JSON file testcase.json
Processing 10 trajectories.
Splitting trajectories in two blocks of length 5 and 5
Calculating D (shape 5 x 5) with 25 entries
----------[ TIMING ]--------------------
load Universes 0.448 s
universe.transfer_to_memory() 0.178 s
PSA distance matrix 1.483 s
saving output 0.001 s
----------------------------------------
total time 2.111 s
----------------------------------------
This indicates that all MDAnalysis parts are working.
Similarly:
(mdaenv) $ $RPDIR/mdanalysis_psa_partial.py --inputfile testcase.json -n 7
should give something like
Loading paths from JSON file testcase.json
Processing 10 trajectories.
Splitting trajectories in two blocks of length 7 and 3
Calculating D (shape 7 x 3) with 21 entries
----------[ TIMING ]--------------------
load Universes 0.422 s
universe.transfer_to_memory() 0.132 s
PSA distance matrix 1.436 s
saving output 0.001 s
----------------------------------------
total time 1.991 s
----------------------------------------
Supercomputer environment¶
We used TACC Stampede to run the calculations on 32 cores but any cluster or even a multi core workstation might be sufficient.
Make sure all env vars are set (especially MongoDB,
RADICAL_PILOT_DBURL
) and password-less ssh works.On stampede: Set environment variable
RADICAL_PILOT_PROJECT
to your XSEDE allocation:export RADICAL_PILOT_PROJECT=TG-xxxxxx
Activate the mdaenv environment.
You should have the JSON files in the
WORK
directory.
Copy the two scripts to the WORK directory (at the moment, this is a limitation of the scripts to keep them simple)L
cd WORK
cp $RPDIR/{rp_psa.py,mdanalysis_psa_partial.py} .
Map-step: Radical.pilot script¶
RP script¶
The pilot script is rp/rp_psa.py (see rp_psa.py for details). In the following, important parts of the script are explained.
First, a session is created and a Pilot Manager is initialized:
65 66 67 68 69 70 | session = rp.Session (name=session_name)
c = rp.Context ('ssh')
session.add_context (c)
print "initialize pilot manager ..."
pmgr = rp.PilotManager (session=session)
|
Second, a ComputePilot is described and submitted to the Pilot Manager:
73 74 75 76 77 78 79 80 81 82 | pdesc = rp.ComputePilotDescription ()
pdesc.resource = "xsede.stampede" # xsede.stampede or xsede.comet or ... check docs!
pdesc.runtime = 20 # minutes
pdesc.cores = cores
pdesc.project = PROJECT #Project allocation, pass through env var PROJECT
pdesc.cleanup = False
pdesc.access_schema = 'ssh'
# submit the pilot.
pilot = pmgr.submit_pilots (pdesc)
|
First, a Unit Manager is created for that session and the Pilot is added:
85 86 87 88 | umgr = rp.UnitManager (session=session, scheduler=rp.SCHED_DIRECT_SUBMISSION)
#add pilot to unit manager
umgr.add_pilots(pilot)
|
The script reads the topology and trajectory files from a JSON file:
91 92 | with open(FILELIST) as inp:
topologies, trajectories = json.load(inp)
|
The MDAnalysis script is staged
99 100 101 102 103 104 105 106 107 108 109 110 111 112 | fname_stage = []
# stage all files to the staging area
src_url = 'file://%s/%s' % (os.getcwd(), SHARED_MDA_SCRIPT)
#print src_url
sd_pilot = {'source': src_url,
'target': os.path.join(MY_STAGING_AREA, SHARED_MDA_SCRIPT),
'action': rp.TRANSFER,
}
fname_stage.append(sd_pilot)
# Synchronously stage the data to the pilot
pilot.stage_in(fname_stage)
|
In a loop, a CU is set up for each block matrix — this is the
map step. In particular, the trajectories are partitioned
according to the BLOCK_SIZE
(the parameter \(w\)) and all
necessary information is written to a JSON file that will be used as
the input for the mdanalysis_psa_partial.py script:
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | cudesc_list = []
for i in range(0, len(trajectories), BLOCK_SIZE):
for j in range(i, len(trajectories), BLOCK_SIZE):
fshared = list()
shared = {'source': os.path.join(MY_STAGING_AREA, SHARED_MDA_SCRIPT),
'target': SHARED_MDA_SCRIPT,
'action': rp.LINK}
fshared.append(shared)
shared = [{'source': 'file://{0}'.format(trajectory),
'target' : basename(trajectory),
'action' : rp.LINK}
for trajectory in trajectories[i:i+BLOCK_SIZE]]
fshared.extend(shared)
if i!=j:
shared = [{'source': 'file://{0}'.format(trajectory),
'target' : basename(trajectory),
'action' : rp.LINK}
for trajectory in trajectories[j:j+BLOCK_SIZE]]
fshared.extend(shared)
# always copy all unique topology files
shared = [{'source': 'file://{0}'.format(topology),
'target' : basename(topology),
'action' : rp.LINK}
for topology in set(topologies)]
fshared.extend(shared)
# block of topology / trajectory pairs
# block[:nsplit] + block[nsplit:]
# The MDA script wants one long list of trajectories and the index nsplit
# that indicates where to split the list to create the two groups of
# trajectories that are compared against each other.
block_top = topologies[i:i+BLOCK_SIZE] + topologies[j:j+BLOCK_SIZE]
block_trj = trajectories[i:i+BLOCK_SIZE] + trajectories[j:j+BLOCK_SIZE]
block = [block_top, block_trj]
nsplit = len(trajectories[i:i+BLOCK_SIZE])
imax = i + len(trajectories[i:i+BLOCK_SIZE])
jmax = j + len(trajectories[j:j+BLOCK_SIZE])
# should remember i, imax and j_jmax because we calculate the
# submatrix D[i:i+di, j:j+dj] in this CU.
block_json = "block-{0}-{1}__{2}-{3}.json".format(
i, imax, j, jmax)
block_matrixfile = 'subdistances_{0}-{1}__{2}-{3}.npy'.format(
i, imax, j, jmax)
manifest.append((block_matrixfile, block_json, (i, imax), (j, jmax)))
# create input file for the cu and add share it
with open(block_json, "w") as out:
json.dump(block, out)
# share input json file
shared = [{'source': 'file://{0}'.format(os.path.realpath(block_json)),
'target' : basename(block_json),
'action' : rp.LINK}
]
fshared.extend(shared)
# define the compute unit, to compute over the trajectory submatrix
# TODO: store the offsets WITH the returned matrix (pkl or arrach archive) instead
# of encoding in filename
cudesc = rp.ComputeUnitDescription()
cudesc.executable = "python"
cudesc.pre_exec = ["module load python; source activate mdaenv"] #Only for Stampede and with our conda env
cudesc.input_staging = fshared
cudesc.output_staging = {'source': block_matrixfile,
'target': block_matrixfile,
'action': rp.TRANSFER}
cudesc.arguments = [SHARED_MDA_SCRIPT, '--nsplit', nsplit,
'--inputfile', block_json,
'--outfile', block_matrixfile, ]
cudesc.cores = 1
cudesc_list.append(cudesc)
|
For the reduce step, all information about the block matrix (filename and indices in the distance matrix) are written to a JSON file (“manifest”):
193 194 | with open(MANIFEST_NAME, "w") as outfile:
json.dump(manifest, outfile)
|
Finally, the CUs are submitted to execute on the compute resources and the script waits until they are all complete:
197 198 199 200 201 202 | # submit, run and wait and...
#print "submit units to unit manager ..."
units = umgr.submit_units(cudesc_list)
#print "wait for units ..."
umgr.wait_units()
|
Launch pilot jobs¶
Launch the pilot job from the login node of stampede (or the cluster where you set up your radical.pilot environment):
python rp_psa.py trajectories.json 20 16 SPIDAL.001
The rp_psa.py radical.pilot script takes as input (see rp_psa.py for details):
the JSON file with the trajectories (trajectories.json)
number of trajectories per block (20)
number of cores to request (16)
session name (arbitrary string, “SPIDAL.001”); note that you need to provide a new session name every time you-rerun the command, e.g. “SPIDAL.002”, “SPIDAL.003”, …
Reduce-step¶
Once the pilot jobs has completed and all block matrices have been computed we combine all data (reduce) into the distance matrix \(D_{ij}\) and analyze it:
psa_reduce.py -t trajectories.json -m manifest.json \
-p psa_plot.pdf -o distance_matrix.npy
Combine block matrices¶
The manifest.json
file contains all information that is needed to
re-assemble the distance matrix: for each output file it also stores
the indices of the sub-matrix in the distance matrix and so the full
distance matrix can be built with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | def combine(traj, manifest):
trj = json.load(open(traj))
N = len(trj[0])
D = np.zeros((N, N))
mf = json.load(open(manifest))
missing = []
for subD, _, (i0, i1), (j0, j1) in mf:
try:
D[i0:i1, j0:j1] = np.load(subD)
D[j0:j1, i0:i1] = D[i0:i1, j0:j1].T
except IOError:
missing.append(subD)
if missing:
print("WARNING: Missing data: output files\n{}\n"
"could not be found".format(", ".join(missing)))
return D
|
The matrix is written to a numpy file distance_matrix.npy
(and can
be loaded with numpy.load()
).
Analysis¶
The distance matrix can be clustered to reveal patterns in the
trajectories. Here we use hierarchical clustering with Ward’s linkage
criterion as implemented in scipy.cluster.hierarchy
[Seyler2015]. The clustering and plotting code was taken from the
cluster()
and
plot()
methods of
PSAnalysis
. The plotting code is
fairly complicated but the clustering is straightforward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | def cluster(distArray, method='ward', count_sort=False,
distance_sort=False, no_plot=False, no_labels=True,
color_threshold=4):
"""Cluster trajectories and optionally plot the dendrogram.
:Arguments:
*method*
string, name of linkage criterion for clustering [``'ward'``]
*no_plot*
boolean, if ``True``, do not render the dendrogram [``False``]
*no_labels*
boolean, if ``True`` then do not label dendrogram [``True``]
*color_threshold*
For brevity, let t be the color_threshold. Colors all the
descendent links below a cluster node k the same color if k is
the first node below the cut threshold t. All links connecting
nodes with distances greater than or equal to the threshold are
colored blue. If t is less than or equal to zero, all nodes are
colored blue. If color_threshold is None or ‘default’,
corresponding with MATLAB(TM) behavior, the threshold is set to
0.7*max(Z[:,2]). [``4``]]
:Returns:
list of indices representing the row-wise order of the objects
after clustering
"""
matplotlib.rcParams['lines.linewidth'] = 0.5
Z = scipy.cluster.hierarchy.linkage(distArray, method=method)
dgram = scipy.cluster.hierarchy.dendrogram(
Z, no_labels=no_labels, orientation='left',
count_sort=count_sort, distance_sort=distance_sort,
no_plot=no_plot, color_threshold=color_threshold)
return Z, dgram
|
In the resulting plot we indicate FRODA trajectories with a heavy double-bar and DIMS trajectories with a thin single bar.
Symbol |
Method |
---|---|
|| |
FRODA |
| |
DIMS |
Figure PSA of AdK transitions shows clearly that DIMS and FRODA transitions are different from each other. Each method produces transitions that are characteristic of the method itself. In a comparison of many different transition path sampling methods it was also found that this pattern generally holds, which indicates that there is likely no “best” method yet [Seyler2015].