Blog

GSoC / Outreachy 2022 wrap-up

This year MDAnalysis participated in two programs with wonderful contributors from around the globe. We continued our participation in Google Summer of Code for another year and also participated in Outreachy for the first time. Huge thanks to these organisations for supporting MDAnalysis and the students.

Google Summer of Code / Outreachy 2022

Google Summer of Code

Our two outstanding GSoC contributors Aya Mohamed Alaa (@aya9aladdin), and Bjarne Feddersen (@BFedder) successfully completed their GSOC projects.

You can read more about their projects and packages in their blog posts:

— GSOC Mentors @jbarnoud @hmacdope @ojeda-e @IAlibay @fiona-naughton @orbeckst @lilyminium @richardjgowers

Google Summer of Code / Outreachy 2022

Outreachy

Uma Kadam (@umak1106) did the very first MDAnalysis Outreachy Internship with her project Improve MDAnalysis by implementing type hinting.

— Outreachy Mentors @jbarnoud @micaela-matta @richardjgowers

Summary

We have immensely enjoyed working with Aya, Bjarne and Uma and look forward to seeing how they continue to contribute to the MDAnalysis community, and wish them luck in their future endeavours.

Outreachy Report - 2022 Improve MDAnalysis by implementing type hinting

About Me

I am Uma Kadam , a Computer Science and Engineering undergraduate at Indian Institute Of Information Technology Guwahati. My interests primarily lie in exploring ML & AI which is evident from my research internship experience on NLP . Participating in numerous hackathons, some of which I won, allowed me to explore new technologies and domains in computer science. My involvement with Outreachy provided me with a first-hand introduction to Open Source and set me on the path to a successful career in technology.

Linkedin LinkedIn   GitHub GitHub

Outreachy

Outreachy is a 12+ week internship program where contributors work with an open-source organization under the guidance of experienced mentors. By providing opportunities to work with participating organizations, Outreachy supports people from underrepresented groups in technological sector. Providing mentorship to build technical skills and establishing an inclusive community that has no room for systematic bias or discrimination, it aims to help minority members pave the way into the tech industry.

Motivation

The object-oriented Python library MDAnalysis analyzes trajectory data derived from molecular dynamics (MD) simulations in many popular formats. Python’s dynamic nature enables us to develop with speed, flexibility, and ease of use but if you are not careful, you may trade short-term expedience for long-term lack of maintainability because of the dynamic nature of Python. Type hints are implemented with the help of typing module.

While type hints and type annotations do hint towards or indicate the appropriate types they do not enforce them. By utilizing typecheckers such as mypy, I made sure that the code is performing what it should regarding the types passed around between functions, and the annotated function signatures only boosted the readability of the code as well as improved communication within it.

Introducing type annotations and type hints provided a multitude of benefits like:

  • The need to document the type in the docstring got eliminated.

  • Datatypes got clearly defined in the code, removing any potential datatype ambiguity.

  • It was beneficial in catching errors, improving linting and providing a more cleaner architecture for MDanalysis .

  • It was helpful in making the code more organized and speeding up the debugging process .

  • It provided us with optional static typing to leverage the best of both static and dynamic typing.

Contributions made during Outreachy Project

  • Addition of Mypy requirements to CI pipeline: #3705

    Addition of mypy to github actions workflow provided us the ability to run a mypy check on every new pull request made and to raise appropriate warnings whenever the type hints provided were incorrect or erroneous. Added customizations like running mypy checks only on more prioritized modules and so on . The type checking in CI was made to be blocking wherever required.

  • Providing Type Hints for lib module and annotations for init file: #3823 #3729

    The first PR deals with task of providing type annotations for the init file.

    The second PR deals with the task of providing type annotations and type hints for the lib module and usage of mypy type checker to ensure all the functions and variables are correctly type hinted. Usage of typing module and numpy.typing module for providing type hints. In the lib module I provided type hints for NeighbourSearch, PeriodicKDTree, Mathematical_helper_functions and so on.

  • Providing type hints for core module: #3719 Docs

    Dealt with the issue caused by circular imports and provided type hints for the core module.

  • Providing type hints for visualization module, auxiliary module, topology module, analysis module and converters module

    #3781 #3746 #3782 #3744 #3752 #3784

    Usage of np.ndarray and other type hints from the typing module for type hinting and annotating the streamlines and streamlines_3D files in the visualization module.Provide type hints, as well as use mypy to efficiently perform type checks for the different parsers present in the Topology module, and see if any unit tests have been broken as a result.Providing type hints for converters, auxiliary and other modules.

The annotations are not all merged yet, and they do not cover the full module. However, spending efforts on type annotations gave us an idea of the challenges it represents for MDAnalysis and will lead to better code in the future. Already, we identified corner cases that could have led to bugs.

Example :

    def search(self, atoms, radius, level='A'):

With the help of type hints this is tranformed to:

    def search(self, atoms: AtomGroup, radius: float, level: str = 'A') -> Optional[Union[AtomGroup, ResidueGroup, SegmentGroup]]:

Here the type hints used for the input parameters of the function search inform us that the type of Atoms is Atomgroup, radius is a float value and level is a string which takes default value ‘A’.

The return type of the function can be None or AtomGroup or ResidueGroup or SegmentGroup. Here the Optional keyword suggests that the type can be None or the type enclosed within its brackets and the Union keyword suggests that the type could be anything from the types conatined within its brackets.

Conclusion :

In my opinion, Outreachy provided an incredible learning opportunity for a newcomer to open source development like me. There is no doubt that the internship was an exceptional experience and I would like to extend my sincere gratitude to Outreachy, my mentors, and the MDAnalysis community for that experience. During my time working with the MDAnalysis library, I have learned a great deal not only from the technical side of working with the library, but also from experiencing the rich set of policies and guidelines that are created by the MDAnalysis community, and how they shape MDAnalysis’ inner workings.

GSoC Report: Adding a new guessing methodology for molecular systems properties in MDAnalysis

Project motivation

In MDAnalysis, molecular topologies come from various file formats and forcefields, each of which has its own rules regarding formatting and molecular properties representation. Not all molecular properties are read from files, either because they are simply not provided in the file or because they can be inferred from the nature of the molecular system or forcefield definitions (e.g., for atomic systems masses can be inferred from atom type, while in the Martini forcefield there are only three definitions of masses for its known beads). MDAnalysis had different guessing methods that can infer missing molecular properties for the Universe’s topology (eg. guessing Atom type, mass, and bond). But those methods suffered from the main issue of being a general-prpose methods, that doesn’t cover any kind of special definitions for the various forcefields and file formats, which causes the guessing output to be inaccurate and not reliable for some topologies; as not all topologies speak the same language. That’s why we need to develop tailored context-aware guesser classes to serve different molecular dynamics worlds, in addition, we need to establish a way to pass this context to the Universe, so the Universe uses the appropriate guesser class in guessing topology attributes.

So, my GSoC project was about developing a new guessing API, that will make the guessing process more convenient and context-specific, plus developing the first two context-specific guessers, which are PDB Guesser and Martini forcefield Guesser.

1- Developing the new guesser API

The first thing to start with was the development of the new guesser API and the building of the DefaultGuesser class (which will represent a general-purpose guesser that has the current general-purpose guesser methods). But before that, we must first remove all the usage of the old guessing methods inside the MDAnalysis library to prepare for the new stage of having context-aware guessers and the guessing API. But this can lead to unexpected wrong behavior if we did without being careful. Guessing mainly takes place inside 16 topology parsers (types and mass guessing specifically), and not all guessing inside those parsers happens the same way. So, to not break the current behavior, we must be careful with removing all the guessing from those parsers and yet keep the same default output while developing the new API.

N.B.: it was important to have all the work on developing both the guesser API and the DefaultGuesser plus changing parsers behavior to be done inside one pull request here: #3753; because all thoses changes have a direct effect on each others, for example, the optmization of the new guesser API must be measured by how far the topology output changed after removing all the guessings processes from parsers. This was the most challenging thing about the project; as all these required changes are interconnected and must be done simultaneously, to see how it reflects and interacts with each other, and accordingly we can achieve optimization. So, working on this PR was not sequential as mentioned below, it was rather a continuous back-and-forth update on all touched parts of the library.

- Removing types and mass guessing from Topology Parsers (commit 0cbc497)

First, I navigated through all parsers to see where and how types and mass guessing take place. The below table shows the behavior of types and masses guessing inside parsers.

Guessing inside Parsers

As we see, not all parsers use the same attribute in guessing masses. The DefaultGuesser guesses masses by first looking for the elements attribute and if not available, it looks for the types attribute, this behavior preserves the default behavior of all parsers except the ones that guess masses from names. Parsers that guess masses from types are FHIAIMSParser, TXYZParser, and XYZParser. For both FHIAIMSParser and XYZParser element attributes are provided, and luckily it is just a copy of the names attribute, so the default behavior of these parsers is not affected. For TXYZParser, I added a feature of publishing Elements attribute to its topology output in case all read names are valid elements (merged PR #3836).

- Developing the guess_topologyAttributes API

As described in issue # 3704, the new guesser API will have the tasks of guessing and adding new topology attributes to the Universe. The user only passes the context (eg. “default”, “pdb”, “martini”) and the attributes of interest to be guessed, either through the to_guess or force_guess parameters and the API does the rest. Since I was writing my proposal, I was keen to develop the guesser API with this high level of abstraction and make guessing attributes easy and straightforward to the user without bothering about which method should be called and which attribute(s) is used as a reference in guessing other attributes, so all this work is handled inside the API and the guessers classes. The guess_topologyAttributes() make the following processes:

  1. gets the appropriate Guesser class that matches the passed context.

  2. checks if the Guesser class support guessing the attributes passed to theto_guess and/or force_guess parameters.

  3. check if any attribute passed to the to_guess parameter already exists in the topology attributes of the Universe. If so, warn the user that only empty values will be filled for this attribute, if any exists and in case the user wishes to override all the attribute values, he must pass it to the force_guess parameter instead of the to_guess one.

  4. guessing attributes is handled by the guess_attrs method(), which is declared inside the parent guesser class GuesserBase. It manages partial and complete guessing of attributes and calls the appropriate guesser method for each attribute.

  5. each attribute guesser method searches for the reference attribute to begin guessing from it, and if not found in the Universe, it calls the guess_topologyAttributes to try guessing this reference attribute.

  6. after guessing the attribute, the API adds it to the Universe with the help of add_topologyAttr or _add_topology_objects Universe’s methods

Example of using the guess_topologyAttributes at Universe initiation:

# to guess bonds for a [Universe]:
 
import MDAnalysis as mda
from MDAnalysisTests.datafiles import two_water_gro
 
u = mda.Universe(two_water_gro, context='default', to_guess=['bonds'])

Example of using the guess_topologyAttributes directly:

# guess masses and types attribute::
 
u.guess_TopologyAttributes(context='default', to_guess=['masses', 'types'])

Example of passing empty to_guess list to guess_topologyAttributes so no guessing takes place at universe creation:

# silencing masses and types guessing at universe creation::
 
import MDAnalysis as mda
from MDAnalysisTests.datafiles import two_water_gro
 
u = mda.Universe(two_water_gro, to_guess=())

More explanation is found in the user guide here: guess_topologyAttributes, Guessing

- DefaultGuesser class

The DefaultGuesser class holds the same old guessing methods but with modifying them to be compatible with the new guess_topologyAttrsibutes API. Moreover, I added a new feature to type guessing method so that it now can guess types from masses if names are not available (commit 348f62d)

- Testing the old parser behavior is preserved

The to_guess parameters of the Universe have a default value of ("types", "masses") to maintain the default behavior of the parsers. To make sure of not breaking old behavior, I added three tests in the parser’s base.py module to check three things:

  1. types and masses are guessed as expected in all Universe’s created with the parsers after removing guessing from them.

  2. The values of the guessed types with the guess_topologyAttributes API are the same as those from the old behavior.

  3. The values of the guessed masses with the guess_topologyAttributes API are the same as those from the old behavior. Once those three tests are passed, then it’s safe to say that we are not breaking the default behavior of the code (commit af84927).

I also discovered a bug in Topology’s methods gussed_attributes() and read_attributes() while working on developing the guesser API and fixed it (merged PR # 3779)

2- Working on PDBGuesser

Currently, I’m working on developing the PDBGuesser issue #3856. The generic guessing methods can’t deal optimally with PDB files, which makes guessing processes slow and not reliable for pdb-generated topologies. So, if we had a PDB-aware guesser, this process could improve significantly, especially that PDB has a huge archive called the chemical component dictionary (CCD), which describes every single residue that exists in the PDB database (its atom names, atom elements, bonds, bond orders, charges, aromaticity, etc.), plus that PDB has a well-formatted structure, that makes it easy to infer topology properties from it. I’m working on developing guesser methods for elements, masses, bonds, and aromaticity.

a. Elements guessing

PDB has a well-defined format for names, from which we can get the atomic symbol easily. Atom names are found in columns 13-16. The first two characters represent the atom symbol, and if the symbol consists of one character, then the first character is blank. At the third character comes the remoteness indicator code for amino acid residues ['A', 'B', 'G', 'D', 'E', 'Z', 'H']. Then the last character is a branching factor if needed.

The above rules are the standard rules but there are some exceptions to them:

  • If the first character is blank and the second character is not recognized as an atomic symbol, we check if the third character contains “H”, “C”, “N”, “O”, “P” or “S”, then it is considered the atomic symbol.

  • If the first character is a digit, , , or *, then the second character is the atomic symbol.

  • If the first character in ‘H’ and the residues are a standard amino acid, nucleic acid, or known hetero groups (found in pdb_tables.py), then the atom element is ‘H’.

  • If the first two characters are not recognized as an atomic symbol and the first character is ‘H’, then the element is H.

Based on these rules, I developed the guess_types method for PDBGuesser pr #3866.

b. Masses

Mass guessing is the same as generic mass guessing methods, I just added a more detailed message about how many successful guessings happened and how many failed, in addition to which atom type/element the guesser failed to guess mass to.

PDBGuesser is still under discussion, so some of the current implementations may be updated in the future.

Future work

I’m currently working on the PDBGuesser and plan to implement the MartiniGuesser after the GSoC period. Completing those context-aware guessers is not just important for having more tailored guessers, but also crucial for testing how the new guesser methodology is flexible and can work fine with different types of Guessers classes that will be implemented in the future.

Lessons learned

GSoC was my first software internship, and I feel lucky that I got the chance to participate in such a wonderful program under the MDAnalysis organization, where everyone is supportive and friendly. I have learned about contributing to the open-source community, and the value of being a part of it. I also learned more about software engineering principles and how to better estimate the time and effort needed for your work. Additionally, I learned how to set priorities in developing new features, and the importance of test-driven software development. Finally, I learned how to effectively communicate and represent my work, and the importance of clean code and well-documented steps.

I’m happy with my experience with MDAnalysis and looking forward to increasing my contribution to the library, especially in the guessing and topology parts which I gained lots of experience at.

Acknowledgements

I’d like to thank all my mentors for their effort and valuable lessons they gave to me through the program period, and I’m specially gratful for @jbarnoud (Jonathan Barnoud) for his endless guidance and patience through every step in the project.

For more information about the details of my journey with GSoC throught the summer you can check my personal blog post Here.

@aya9aladdin