Blog

GSoC 2022 - Adding Energy Readers to MDAnalysis

Motivation

In molecular dynamics simulations, users frequently have to inspect energy-like terms such as potential or kinetic energy, temperature, or pressure. This is so common a task that even small inefficiencies add up. Currently, users have to create intermediate files from their MD simulation’s output files to obtain plot-able data, and this quickly becomes cumbersome when multiple terms are to be inspected. Being able to read in the energy output files directly would make this more convenient.

Therefore, I wanted to add readers for energy-type files (output files containing information on potential and kinetic energy, temperature, pressure, and other such terms) from a number of MD engines to the auxiliary module of MDAnalysis in this project. This would make quality control of MD simulations much more convenient, and allow users to analyse the energy data without the need for switching windows or writing intermediate files directly from within their scripts or jupyter notebooks.

In a first instance, I focussed on a reader for EDR files, which are energy files written by GROMACS during simulations. EDR files are binary files which follow the XDR protocol. To read these files, @jbarnoud had previously written the panedr Python package, which was the foundation of my work this summer.

Adapting Panedr for use in MDAnalysis

The panedr package makes use of the xdrlib Python module to parse EDR files and return the data in the form of a pandas DataFrame. My GSoC project started out adapting this package for use in MDAnalysis. In particular, we wanted to avoid making pandas a dependency in MDAnalyis. This necessitated some refactoring of panedr (PR #33), which ultimately led to a restructuring of the code into two distinct packages: panedr and pyedr (PRs #42 and #50). Both packages read EDR files, but one returns the data as a pandas DataFrame, the other as a dictionary of NumPy arrays. Both also expose a function to return a dictionary of units of the energy terms found in the file (PR #56).

Example:

import pyedr
file = "path/to/edr/file.edr"
energy_dictionary = pyedr.edr_to_dict(file)
unit_dictionary = pyedr.get_unit_dictionary(file)

EDRReader

With Pyedr available, I started work on the implementation of an EDRReader in MDAnalyis (PR #3749). Here, I benefited hugely from the existing AuxReader framework. However, from the outset, it was clear that the auxiliary API would need to be changed to accommodate the large number of terms found in EDR files. The XVGReader was built under the assumption that auxiliary files would only ever contain one time-dependent term, so the XVG files would contain a time value and a data value per entry. Associating data with a trajectory via the add_auxiliary method thus only required two arguments: a name under which to store the data in universe.trajectory.ts.aux, and the file from which to read it. This is not suitable for EDR files, as they can contain dozens of terms per time point, only a few of which might be relevant for a given analysis. Therefore, the auxiliary API had to be changed as follows: While the XVGReader still works as previously, the new base class for adding auxiliary data assumes a dictionary to be passed. The dictionary maps the name to be used in MDAnalysis to the names read from the EDR file. This is shown in the following minimal working example:

import MDAnalysis as mda
from MDAnalysisTests.datafiles import AUX_EDR, AUX_EDR_TPR, AUX_EDR_XTC
term_dict = {"temp": "Temperature", "epot": "Potential"}
aux = mda.auxiliary.EDR.EDRReader(AUX_EDR)
u = mda.Universe(AUX_EDR_TPR, AUX_EDR_XTC)
u.trajectory.add_auxiliary(term_dict, aux)

Aside from this API change, the EDRReader can do everything the XVGReader can. In addition to that, it has some new functionality.

  • Because EDR files can become reasonably large, a memory warning will be issued when more than a gigabyte of storage is used by the auxiliary data. This default value of 1 GB can be changed by passing a value as memory_limit when creating the EDRReader object.
  • EDR files store data of a large number of different quantities, so it is important to know their units as well. The EDRReader therefore has a unit_dict attribute that contains this information. By default, units found in the EDR file will be converted to MDAnalysis base units on reading. This can be disabled by setting convert_units to False on creation of the reader.
  • In addition to associating data with trajectories, the EDRReader can also return the NumPy arrays of selected data, which is useful for plotting, for example. This is done via the EDRReader’s get_data method.

Additionally, the new auxiliary readers allow the selection of frames based on the values of the auxiliary data. For example, it is possible to select only frames with a potential energy below a certain threshold as follows:

u = mda.Universe(AUX_EDR_TPR, AUX_EDR_XTC)
term_dict = {"epot": "Potential"}
u.trajectory.add_auxiliary(term_dict, aux)
selected_frames = np.array([ts.frame for ts in u.trajectory if ts.aux.epot < -524600])

Having selected these frames, it is possible to analyse only this subset of a trajectory:

protein = u.select_atoms("protein")
for ts in u.trajectory[selected_frames]:
    do_analysis(protein)

More details on the EDRReader’s functionality can be found in the MDAnalysis User Guide.

Outlook

Through this project, the AuxReader framework was expanded, and handling of EDR files was made more convenient with pyedr and EDRReaders. I am continually making improvements to these contributions, and will include an auxiliary reader for NumPy arrays in the future. This NumPyReader will be very useful, because many analysis methods in MDAnalysis return their results in the form of NumPy arrays. Having the option of associating these results with trajectories will facilitate further analyses, for example allowing the slicing of trajectories by RMSD to a reference structure.

In general, the changes made to the auxiliary API should make it easier for additional AuxReaders to be developed. Being able to easily associate any number of terms to each time step is helpful for general readers (for example for parsing CSV data) and for more specific readers (for parsing energy files generated by other MD engines, for example Amber or NAMD). With the actual handling of the data already taken care of, the challenge in the implementation here would lie in the correct parsing of the plain text files, and in proper testing and future proofing.

Lessons learned

Participating in the Summer of Code was a great opportunity for me. I learned a lot, from small things like individual code patterns to larger points concerning overall best practices, the value of test-driven development, and package management. This is thanks in large part to the mentorship and advice I have received from @hmacdope, @ialibay, @orbeckst, and @fiona-naughton. Thanks very much to you all, and to @jbarnoud.

@bfedder


MDAnalysis CZI EOSS 5 Grant Outreach and Project Manager

We are happy to announce that MDAnalysis has been awarded a grant from the Chan Zuckerberg Initiative as part of the Essential Open Source Software for Science program: “EOSS5: Growing the MDAnalysis community sustainably: A dedicated project manager, teaching and outreach initiatives”.

The MDAnalysis organisation strongly believes that engaging in outreach, mentoring and teaching is key to its mission of being the leading software package for molecular simulation analysis in Python.

This 2-year grant will enable us to hire a full-time community, outreach, and project manager to extend our teaching and mentoring commitments and engage with the molecular simulation community across academia and industry.

Over the next two years, our key deliverables include:

  • Increasing our participation in outreach activities
  • Organising user group meetings (UGMs)
  • Hosting a series of online teaching workshops (3 per year)
  • Networking with other software projects within and outside the molecular simulation space
  • Engaging with industrial partners towards opening additional funding streams for the project

A big thank you to all contributors, past and present, for making this possible.

See the job description and apply for the role of community, outreach and project manager

Relicensing MDAnalysis

LGPLv3 logo

This blog post outlines MDAnalysis’ proposal to change its license to the GNU Lesser General Public License (LGPL v3+).

A summary of our reasons for proposing this license change, alongside upcoming actions for community members and library contributors are provided.

⚠️ Disclaimer The MDAnalysis core team members are not lawyers. As such the information provided here does not, and is not intended to, constitute legal advice. This blog post also does not represent MDAnalysis’ full legal position on software licensing; it simply aims to inform MDAnalysis developers and users on why we believe the library should be relicensed.

Further information on open-source software licensing can be found from sources such as the Open Source Initiative, tl;drLegal and the Software Sustainability Insitute.

Should you have any concerns about licensing, we always strongly recommend getting legal advice before making any decisions on how licensing changes may affect you.

Overview

We want to change the license of MDAnalysis from the GNU General Public License v2 (or any later versions) (GPL v2+) to the less restrictive GNU Lesser General Public License v3 (or any later versions) (LGPL v3+) license. Both are open source licenses but it is our view that the LGPL v3+ will give developers more freedom in how they license any of their own codes that make use of MDAnalysis.

As detailed by the Open Source Definition, licenses are core to the definition of open source. “Open source doesn’t just mean access to the source code”. The license defines how code can be used, copied, changed, and incorporated into other code.

License changes will affect how people interact with the MDAnalysis code base going forward. We need the agreement of our contributors and community members to change from GPL v2+ to LGPL v3+.

In this post we want to share our motivation, outline the relicensing process, and invite comments / questions from the community.

Rationale for license change

Why is GPL v2+ no longer the best choice?

Since its initial release in 2008, MDAnalysis has grown from a small Python package used by a handful of enthusiastic graduate students and postdocs to a mature library that is used by thousands of researchers in the molecular sciences. The MDAnalysis library was published under an open source license from the start so that anyone could freely use it, contribute to it, and build on it. We chose the GNU General Public License version 2+ (GPL v2+) for this purpose. The GPL v2+ has a “copy-left” clause that requires anyone using MDAnalysis in their own code to also adopt a compatible version of the GPL for their code. This means that code contributors could feel that any time and work that they invested into MDAnalysis would not end up contributing to software without open-source licensing.

However, the GPL v2+ has also created barriers to adoption of MDAnalysis. Under many interpretations, ours included, this prevents developers who use MDAnalysis from making their own code available under non-GPL licenses. It is the MDAnalysis core team’s view that we do not want to dictate how our developers and users should license their code, but we do wish that work on the MDAnalysis library remains open and free.

Changing to a less restrictive license would benefit the MDAnalysis community, increasing the number of codes which can use MDAnalysis, and enabling users in corporate environments to use the library with more certainty. The reduced licensing complexity also paves the way for our proposed MDAKit ecosystem.

Why now the LGPL v3+?

We therefore propose to undergo the process of relicensing MDAnalysis under the GNU Lesser General Public License v3 (or any later versions). This open source license fulfills a number of important requirements for us:

  1. Downstream codes are able to freely import or link to MDAnalysis library components without impacting the license choice of the downstream code.

  2. Downstream codes are able to use and subclass any MDAnalysis components under its application programming interface (classes, methods, and data objects), without impacting the license choice of the downstream code.

  3. Codes that either copy or extend the MDAnalysis library should fall under the copyleft license requirements of the MDAnalysis library license.

Thus, it is our view that the LGPL v3+ license gives people the freedom to choose any license for their own code that makes use of the MDAnalysis library as a whole (namely import MDAnalysis or subclassing). This includes closed / commercial licenses (although we encourage the use of open source licenses). However, one would not be able to just take parts of the MDAnalysis code and add it into another codebase unless this other code is then also licensed under a compatible copyleft license (e.g. GPLv3+/LGPLv3+).

We considered other popular licenses but none fulfilled the requirements listed above.

How will the relicensing process work?

As of writing, MDAnalysis has over 160 contributors, all of whom have contributed code under the terms of the GPL v2+ license. We also have a large user community that uses the library for many wonderful scientific applications, including several downstream libraries.

Ultimately, the final decision on relicensing rests with code authors. However, we fully recognise that this is a big change for the MDAnalysis user base and the wider molecular sciences community. As always, we are fully invested in ensuring that our actions reflect the needs of our community. We therefore want to give everyone an opportunity to ask questions about or comment on the relicensing effort as part of this process.

Consultation period (7th November until 5th December 2022)

We will start the process with an open consultation period lasting 28 days from 7th November to 5th December 2022.

During this period we encourage members of the community, both developers and users, to comment on and ask questions about the proposed relicensing efforts. The aim is to ensure that relicensing is indeed in the interest of the community. We will do our best to account for any concerns raised before attempting to continue with the long and time-consuming process of relicensing.

We wish to open this conversation on our public forums (mailing lists, discord, twitter). As legal matters such as licensing can sometimes be sensitive in nature we have also set up an email address ([email protected]) monitored solely by the MDAnalysis Core Developers for any private queries that you may have.

A summary of open discussions and frequently asked questions will be made available on the MDAnalysis wiki.

Note: Whilst the consultation will only last 28 days, we will continue to engage with conversations on this topic for the entire length of the relicensing process.

Contacting contributors (6th December onwards)

After the consultation period, we will contact every code contributor to the core MDAnalysis library with a request to agree to changing their contribution’s license from the current “GPL v2 or any later version” to “LGPL v3 or any later version”.

It is important that we hear back from as many contributors as possible. If you have contributed to MDAnalysis in the past but have since changed your git-linked contact details, we would kindly ask if you could email [email protected] to let us know how best to contact you.

License change

We do not know how long relicensing will take, especially as contacting historical contributors will likely be a very slow process. Nevertheless, our aim is to change the license as quickly as possible. We will keep the community regularly updated on our progress.

Acknowledgments

We are very grateful for the administrative and legal support from our fiscal sponsor, NumFOCUS.

– The MDAnalysis Core Developers