20 Nov 2022
Motivation
In molecular dynamics simulations, users frequently have to inspect energy-like terms such as potential or kinetic energy, temperature, or pressure. This is so common a task that even small inefficiencies add up. Currently, users have to create intermediate files from their MD simulation’s output files to obtain plot-able data, and this quickly becomes cumbersome when multiple terms are to be inspected. Being able to read in the energy output files directly would make this more convenient.
Therefore, I wanted to add readers for energy-type files (output files containing information on potential and kinetic energy, temperature, pressure, and other such terms) from a number of MD engines to the auxiliary module of MDAnalysis in this project. This would make quality control of MD simulations much more convenient, and allow users to analyse the energy data without the need for switching windows or writing intermediate files directly from within their scripts or jupyter notebooks.
In a first instance, I focussed on a reader for EDR files, which are energy files written by
GROMACS during simulations. EDR files are binary files which follow the XDR protocol. To read these
files, @jbarnoud had previously written the panedr Python package, which was the
foundation of my work this summer.
Adapting Panedr for use in MDAnalysis
The panedr package makes use of the xdrlib Python module to parse EDR files
and return the data in the form of a pandas DataFrame. My GSoC project started out
adapting this package for use in MDAnalysis. In particular, we wanted to avoid making
pandas a dependency in MDAnalyis. This necessitated some refactoring of panedr (PR #33),
which ultimately led to a restructuring of the code into two distinct packages: panedr
and pyedr (PRs #42 and #50). Both packages read EDR files, but one returns the
data as a pandas DataFrame, the other as a dictionary of NumPy arrays. Both also
expose a function to return a dictionary of units of the energy terms found in the file (PR #56).
Example:
import pyedr
file = "path/to/edr/file.edr"
energy_dictionary = pyedr.edr_to_dict(file)
unit_dictionary = pyedr.get_unit_dictionary(file)
EDRReader
With Pyedr available, I started work on the implementation of an EDRReader in MDAnalyis (PR #3749).
Here, I benefited hugely from the existing AuxReader framework.
However, from the outset, it was clear that the auxiliary API would need to be changed to accommodate
the large number of terms found in EDR files.
The XVGReader was built under the assumption that auxiliary files would only ever
contain one time-dependent term, so the XVG files would contain a time value and a data value
per entry. Associating data with a trajectory via the add_auxiliary
method thus only
required two arguments: a name under which to store the data in universe.trajectory.ts.aux
, and the file from which to read it.
This is not suitable for EDR files, as they can contain dozens of terms per time point,
only a few of which might be relevant for a given analysis. Therefore, the auxiliary API had to be changed as follows:
While the XVGReader still works as previously, the new base class for adding auxiliary data assumes a dictionary to be passed. The dictionary maps the name to be used in MDAnalysis to the names read from the EDR file. This is shown in the following minimal working example:
import MDAnalysis as mda
from MDAnalysisTests.datafiles import AUX_EDR, AUX_EDR_TPR, AUX_EDR_XTC
term_dict = {"temp": "Temperature", "epot": "Potential"}
aux = mda.auxiliary.EDR.EDRReader(AUX_EDR)
u = mda.Universe(AUX_EDR_TPR, AUX_EDR_XTC)
u.trajectory.add_auxiliary(term_dict, aux)
Aside from this API change, the EDRReader can do everything the XVGReader can. In addition to that, it
has some new functionality.
- Because EDR files can become reasonably large, a memory warning will be issued when more than a gigabyte of storage is used by the auxiliary data. This default value of 1 GB can be changed by passing a value as
memory_limit
when creating the EDRReader object.
- EDR files store data of a large number of different quantities, so it is important to know their units as well. The EDRReader therefore has a
unit_dict
attribute that contains this information. By default, units found in the EDR file will be converted to MDAnalysis base units on reading. This can be disabled by setting convert_units
to False on creation of the reader.
- In addition to associating data with trajectories, the EDRReader can also return the NumPy arrays of selected data, which is useful for plotting, for example. This is done via the EDRReader’s
get_data
method.
Additionally, the new auxiliary readers allow the selection of frames based on the
values of the auxiliary data. For example, it is possible to select only frames
with a potential energy below a certain threshold as follows:
u = mda.Universe(AUX_EDR_TPR, AUX_EDR_XTC)
term_dict = {"epot": "Potential"}
u.trajectory.add_auxiliary(term_dict, aux)
selected_frames = np.array([ts.frame for ts in u.trajectory if ts.aux.epot < -524600])
Having selected these frames, it is possible to analyse only this subset of a trajectory:
protein = u.select_atoms("protein")
for ts in u.trajectory[selected_frames]:
do_analysis(protein)
More details on the EDRReader’s functionality can be found in the MDAnalysis User Guide.
Outlook
Through this project, the AuxReader framework was expanded, and handling of EDR
files was made more convenient with pyedr and EDRReaders. I am continually making
improvements to these contributions, and will include an auxiliary reader for
NumPy arrays in the future. This NumPyReader will be very useful, because many
analysis methods in MDAnalysis return their results in the form of NumPy arrays.
Having the option of associating these results with trajectories will facilitate
further analyses, for example allowing the slicing of trajectories by RMSD to a reference
structure.
In general, the changes made to the auxiliary API should make it easier for
additional AuxReaders to be developed. Being able to easily associate any number of terms
to each time step is helpful for general readers (for example for parsing CSV data)
and for more specific readers (for parsing energy files generated by other MD engines, for example Amber or NAMD).
With the actual handling of the data already taken care of, the challenge in the implementation here would
lie in the correct parsing of the plain text files, and in proper testing and future proofing.
Lessons learned
Participating in the Summer of Code was a great opportunity for me. I learned a lot, from small things like individual code patterns to larger points concerning overall best practices, the value of test-driven development, and package management. This is thanks in large part
to the mentorship and advice I have received from @hmacdope, @ialibay, @orbeckst, and @fiona-naughton.
Thanks very much to you all, and to @jbarnoud.
– @bfedder
10 Nov 2022
We are happy to announce that MDAnalysis has been awarded a grant from the Chan Zuckerberg Initiative as part of the Essential Open Source Software for Science program: “EOSS5: Growing the MDAnalysis community sustainably: A dedicated project manager, teaching and outreach initiatives”.
The MDAnalysis organisation strongly believes that engaging in outreach, mentoring and teaching is key to its mission of being the leading software package for molecular simulation analysis in Python.
This 2-year grant will enable us to hire a full-time community, outreach, and project manager to extend our teaching and mentoring commitments and engage with the molecular simulation community across academia and industry.
Over the next two years, our key deliverables include:
- Increasing our participation in outreach activities
- Organising user group meetings (UGMs)
- Hosting a series of online teaching workshops (3 per year)
- Networking with other software projects within and outside the molecular simulation space
- Engaging with industrial partners towards opening additional funding streams for the project
A big thank you to all contributors, past and present, for making this possible.
See the job description and apply for the role of community, outreach and project manager
07 Nov 2022
This blog post outlines MDAnalysis’ proposal to change its license
to the GNU Lesser General Public License (LGPL v3+).
A summary of our reasons for proposing this license
change, alongside upcoming actions
for community members and library
contributors are provided.
⚠️ Disclaimer
The MDAnalysis core team members are not
lawyers. As such the information provided here does not, and is not
intended to, constitute legal advice. This blog post also does not
represent MDAnalysis’ full legal position on software licensing; it
simply aims to inform MDAnalysis developers and users on why
we believe the library should be relicensed.
Further information on open-source software licensing can be found
from sources such as the Open Source Initiative,
tl;drLegal and the Software Sustainability Insitute.
Should you have any concerns about licensing, we always strongly
recommend getting legal advice before making any decisions on how
licensing changes may affect you.
Overview
We want to change the license of MDAnalysis from the GNU General
Public License v2 (or any later versions) (GPL v2+) to the less
restrictive GNU Lesser General Public License v3 (or any later versions)
(LGPL v3+) license. Both are open source licenses but it is
our view that the LGPL v3+ will give developers more freedom
in how they license any of their own codes that make use of MDAnalysis.
As detailed by the Open Source Definition, licenses are core to
the definition of open source. “Open source doesn’t just mean access
to the source code”. The license defines how code can be used, copied,
changed, and incorporated into other code.
License changes will affect how people interact with the MDAnalysis code
base going forward. We need the agreement of our contributors and
community members to change from GPL v2+ to LGPL v3+.
In this post we want to share our motivation, outline the relicensing
process, and invite comments / questions from the community.
Rationale for license change
Why is GPL v2+ no longer the best choice?
Since its initial release in 2008, MDAnalysis has grown from a small
Python package used by a handful of enthusiastic graduate students and
postdocs to a mature library that is used by thousands of researchers
in the molecular sciences. The MDAnalysis library was published under
an open source license from the start so that anyone could freely use
it, contribute to it, and build on it. We chose the GNU General Public License
version 2+ (GPL v2+) for this purpose. The GPL v2+ has a “copy-left”
clause that requires anyone using MDAnalysis in their own code to
also adopt a compatible version of the GPL for their code. This means
that code contributors could feel that any time and work that they
invested into MDAnalysis would not end up contributing to software
without open-source licensing.
However, the GPL v2+ has also created barriers to adoption of MDAnalysis.
Under many interpretations, ours included, this prevents developers
who use MDAnalysis from making their own code available under non-GPL
licenses. It is the MDAnalysis core team’s view that we do not want
to dictate how our developers and users should license their code, but we
do wish that work on the MDAnalysis library remains open and free.
Changing to a less restrictive license would benefit the MDAnalysis
community, increasing the number of codes which can use MDAnalysis,
and enabling users in corporate environments to use the library with more
certainty. The reduced licensing complexity also paves the way for our
proposed MDAKit ecosystem.
Why now the LGPL v3+?
We therefore propose to undergo the process of relicensing MDAnalysis
under the GNU Lesser General Public License v3 (or any later versions).
This open source license fulfills a number of important requirements for us:
-
Downstream codes are able to freely import or link to MDAnalysis
library components without impacting the license choice of the
downstream code.
-
Downstream codes are able to use and subclass any MDAnalysis components
under its application programming interface (classes, methods, and
data objects), without impacting the license choice of the
downstream code.
-
Codes that either copy or extend the MDAnalysis library should
fall under the copyleft license requirements of the MDAnalysis
library license.
Thus, it is our view that the LGPL v3+ license gives people the freedom
to choose any license for their own code that makes use of the MDAnalysis
library as a whole (namely import MDAnalysis
or subclassing). This
includes closed / commercial licenses (although we encourage the use of
open source licenses). However, one would not be able to just take parts of
the MDAnalysis code and add it into another codebase unless this
other code is then also licensed under a compatible copyleft license
(e.g. GPLv3+/LGPLv3+).
We considered other popular licenses but none fulfilled the requirements
listed above.
How will the relicensing process work?
As of writing, MDAnalysis has over 160 contributors,
all of whom have contributed code under the terms of the GPL v2+
license. We also have a large user community that uses the library
for many wonderful scientific applications, including several
downstream libraries.
Ultimately, the final decision on relicensing rests with code
authors. However, we fully recognise that
this is a big change for the MDAnalysis user base and the wider
molecular sciences community. As always, we are fully invested in
ensuring that our actions reflect the needs of our community. We
therefore want to give everyone an opportunity to ask questions about
or comment on the relicensing effort as part of
this process.
Consultation period (7th November until 5th December 2022)
We will start the process with an open consultation period lasting
28 days from 7th November to 5th December 2022.
During this period we encourage members of the community,
both developers and users, to comment on and ask questions about the
proposed relicensing efforts. The aim is to ensure that relicensing is
indeed in the interest of the community. We will do our best to account
for any concerns raised before attempting to continue with the long and
time-consuming process of relicensing.
We wish to open this conversation on our public forums (mailing lists, discord, twitter). As legal
matters such as licensing can sometimes be sensitive in nature we have
also set up an email address ([email protected]) monitored
solely by the MDAnalysis Core Developers for any private queries that
you may have.
A summary of open discussions and frequently asked questions will be
made available on the MDAnalysis wiki.
Note: Whilst the consultation will only last 28 days, we will continue
to engage with conversations on this topic for the entire length of the
relicensing process.
After the consultation period, we will contact every code contributor to
the core MDAnalysis library with a request to agree to changing their
contribution’s license from the current “GPL v2 or any later version”
to “LGPL v3 or any later version”.
It is important that we hear back from as many contributors as possible.
If you have contributed to MDAnalysis in the past but have since changed
your git-linked contact details, we would kindly ask if you could email
[email protected] to let us know how best to contact you.
License change
We do not know how long relicensing will take, especially as contacting
historical contributors will likely be a very slow process. Nevertheless,
our aim is to change the license as quickly as possible. We will keep the
community regularly updated on our progress.
Acknowledgments
We are very grateful for the administrative and legal support from our
fiscal sponsor, NumFOCUS.
– The MDAnalysis Core Developers