Creating parser plugins¶
The role of a parser is to map between a structure or unstructure file into a well-defined, standardized, and (ideally) FAIR data schema. In our case, we are interested on mapping simulation data into the nomad-simulations schema. When the sections defined in nomad-simulations are not enough, any user can extend and tune the schema to their specific needs, see Extending the schema.
In NOMAD, when an user creates an upload and push some files, a processing occurs. This processing involves several steps, being matching and parsing the most important and relevant ones:
- Matching the file(s) to the relevant parser class called
<ParserName>Parser. Only a specific file is matched with the parser class, which receives the name of mainfile. - Call into the
<ParserName>Parserclass functionparse(). - The
parse()function populates the archive from the mainfile and other auxiliary files which might be relevant. This archive contains of the relevant metadata in a NOMAD entry.
You can find more information about processing the NOMAD documentation page, see Explanation - Processing. Schematically, this processing can be visualized as:
In this page, we are going to learn how to create a parser, its structure and basic functionalities. NOMAD parsers are plugins, and thus can be defined in their own repositories and be developed independently of the main software. In case of administrating a NOMAD Oasis, you can read more about connecting plugins in the NOMAD documentation, see NOMAD Oasis - Install plugins.
Starting a plugin project¶
To create your own parser plugin, visit the NOMAD plugin template and click the “Use this template” button (you need a Github account to do so):
You can decide where to host the parser plugin. Once this is done, in your local machine you can clone the generated repository. For the purpose of this example, we will use JosePizarro3/example-plugin:
All the steps we are going to do can be found in the branch pyscf-new-template.
Go to the example-plugin directory and create a virtual environment with Python 3.9, 3.10, or 3.11, and activate it:
Install uv (a very fast installer of Python packages) and cruft, and follow the instructions in the README.md to generate the parser:
pip install --upgrade pip
pip install uv cruft
cruft create https://github.com/FAIRmat-NFDI/cookiecutter-nomad-plugin
You will be prompted with some questions and information regarding the plugin. Make sure to include both parser and schema_package options (the latter will be use in the Extending the Schema part):
[1/13] full_name (John Doe): <whatever-name>
[2/13] email (john.doe@physik.hu-berlin.de): <whatever-email>
[3/13] github_username (Github organization or profile name, default: foo): <whatever-github-name>
[4/13] plugin_name (foobar): nomad-parser-pyscf
[5/13] module_name (recommended: press enter to use the default module name)
(nomad_parser_pyscf):
[6/13] short_description (Nomad example template): NOMAD parser plugin for PySCF simulations output in a log text file.
[7/13] version (0.1.0):
[8/13] Select license
1 - MIT
2 - BSD-3
3 - GNU GPL v3.0+
4 - Apache Software License 2.0
Choose from [1/2/3/4] (1): <whatever-license>
[9/13] include_schema_package [y/n] (y): y
[10/13] include_normalizer [y/n] (y): n
[11/13] include_parser [y/n] (y): y
[12/13] include_app [y/n] (y): n
[13/13] include_example_uploads [y/n] (y): n
You can use the script under the generated folder to move all files one level up:
The structure of the plugin is (without including the docs and mkdocs.yml files):
example-plugin/
├── README.md
├── LICENSE
├── pyproject.toml
├── src/
│ └── nomad_parser_pyscf/
│ ├── __init__.py
│ ├── parsers/
│ │ ├── __init__.py
│ │ └── parser.py
│ └── schema_packages/
│ ├── __init__.py
│ └── schema_package.py
├── tests/
│ ├── conftest.py
│ ├── data/
│ │ ├── example.out
│ │ └── test.archive.yaml
│ ├── parsers/
│ │ └── test_parser.py
│ └── schema_packages/
│ └── test_schema_package.py
└── ... (other files)
You can read more about plugins in the NOMAD documentation page, see How to get started with plugins. The entry point for a parser plugin is defined in src/nomad_parser_pyscf/parsers/__init__.py file.
Matching the files to a parser¶
We are going to consider an example file generated by the software PySCF (click here for download), and use it to parse data into the nomad-simulations schema. If you want to learn more about extending the nomad-simulations schema when the information is not defined, see Extending the Schema.
In the template, go to the entry point in src/nomad_parser_pyscf/parsers/__init__.py and change the content to:
from nomad.config.models.plugins import ParserEntryPoint
class PySCFEntryPoint(ParserEntryPoint):
def load(self):
from nomad_parser_pyscf.parsers.parser import PySCFParser
return PySCFParser(**self.dict())
parser_entry_point = PySCFEntryPoint(
name='PySCFParser',
description='Parser for PySCF output written in a log text file.',
mainfile_name_re='.*\.log.*',
mainfile_contents_re=r'PySCF version [\d\.]*',
)
Matching other mainfiles
In the ParserEntryPoint class, there are several attributes that can be defined to match a file. These can be mainfile name, a regular expression (regex) at the beginning of the mainfile, binary file headers, etc. For your specific use-case, you need to identify which is the mainfile and use it to match it to your parser. You can read more of the options in the NOMAD documentation page Reference - Configuration - ParserEntryPoint.
Note that we deleted the configuration parameter field (as it is not relevant for the purpose of this example) and changed some naming of files, src/nomad_parser_pyscf/parsers/parser.py for parser.py, as well as slightly its content:
# all the other imports are here
configuration = config.get_plugin_entry_point(
'nomad_parser_pyscf.parsers:parser_entry_point'
)
class PySCFParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
child_archives: dict[str, 'EntryArchive'] = None,
) -> None:
print('Hello world!')
Note that we have also changed the class imported in tests/parsers/test_parser.py from NewParser to PySCFParser.
Finally, as we are going to use the nomad-simulations package, we need to add it into our dependencies in pyproject.toml:
In order to test the parser, we can install parser plugin in editable mode (with the flag -e):
uv pip install -e '.[dev]' --index-url https://gitlab.mpcdf.mpg.de/api/v4/projects/2187/packages/pypi/simple
Installing nomad-lab
Until we have an official pypi NOMAD release with the plugins functionality make sure to include NOMAD's internal package registry (via --index-url in the above command). Alternatively, you can add the following lines to the pyproject.toml:
With these changes, you can parse the glycine.log file and show the archive in the terminal (note we put the glycine.log file in the tests/data/ sub-folder):
If everything went well, you should see a message run by the PySCFParser().parse() function:
In Python (in a module, Jupyter notebook, or in the interactive terminal), you can test the parsing by importing PySCFParser and specifying the path to the file. Note we also need to instantiate an empty EntryArchive in order to pass it as an input of the parse() function:
from nomad.datamodel import EntryArchive
from nomad_parser_pyscf.parsers.parser import PySCFParser
archive = EntryArchive()
PySCFParser().parse(mainfile='<path-to-mainfile>/tests/data/glycine.log', archive=archive, logger=None)
# Other operations / analyses here
Mapping into nomad-simulations¶
Now that you know the basics of instantiating and populating the nomad-simulations schema (see Assignments in the NOMAD-Simulations) and the basics of setting up a parser, let's combine both concepts and populate the nomad-simulations schema from the output mainfile. The implementation of this plugin will allow you to manage data in a controlled way. You can further use it for analysis tools or include these functionalities in your research workflows. You can read more details in the NOMAD documentation page Plugins - How to write a parser.
Depending on the format of the mainfile, extracting data will be different and you will need to implement the parsing slightly different. You might also need to manipulate the data in order to adapt to the definitions of nomad-simulations (shapes, types, etc.):
- Unstructured text – You need to specify the regular expression (regex) to match the text in the file in order to use it in the
parser.pymodule. This is the case of our example, so you can read below how to implement it. - Structured formats (HDF5, JSON, XML) – You can use Python libraries to extract the data (e.g.,
h5pyfor HDF5 files) and map it into thenomad-simulationsschema. In the specific case of XML, you can use theXMLParser.
In our example, we have an unstructured text file, glycine.log. For simplicity, we will start with the Program information: name and version. For unstructure text, we can use the NOMAD implementation class, TextParser. We need to:
- Define a class inheriting from
TextParser. - Overwrite its method
init_quantities()and defineself._quantitiesas a list ofnomad.parsing.file_parser.Quantity. Note this is a different class than the one used for defining a schema, and we will refer to it asParsedQuantityfrom now on. ParsedQuantity()has akey: valuestructure and can take several arguments, being the most important ones:- First argument is the
keystring to identify the parsed quantity. - Second argument is the regex associated with that
keyto match the text. repeatsis a boolean used to repeat multiple times. Ifrepeats=Trueand the regex matches, it returns a list. Otherwise it returns nothing or a singular value.
- First argument is the
- Instantiate the class of point 1 in the
parse()function and define the path to its mainfile.
We can implement these steps in src/nomad_parser_pyscf/parsers/parser.py. First, we add these lines before PySCFParser(MatchingParser):
from typing import (
TYPE_CHECKING,
)
if TYPE_CHECKING:
from nomad.datamodel.datamodel import (
EntryArchive,
)
from structlog.stdlib import (
BoundLogger,
)
from nomad.config import config
from nomad.parsing.parser import MatchingParser
from nomad.parsing.file_parser import Quantity as ParsedQuantity, TextParser
configuration = config.get_plugin_entry_point(
'nomad_parser_pyscf.parsers:parser_entry_point'
)
class LogParser(TextParser):
def init_quantities(self):
self._quantities = [
ParsedQuantity(
'program_version', r'PySCF *version *([\d\.]+)', repeats=False
)
]
Note we defined LogParser() and its self._quantities to match the version of the PySCF run. The name will be anyways set up to be 'PySCF' during parsing.
If the regex matches, then this should return the value appearing after the string 'PySCF version', i.e., '2.2.1'. We can test this by:
class PySCFParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
child_archives: dict[str, 'EntryArchive'] = None,
) -> None:
log_parser = LogParser(mainfile=mainfile, logger=logger)
print(log_parser.get('program_version'))
If we run now nomad parse tests/data/glycine.log we obtain:
Now, we can instantiate Simulation and Program sections, and add them to the archive. This can be done by:
# other imports here
from nomad_simulations.schema_packages.general import Simulation, Program
# `configuration` and `LogParser` defined here
class PySCFParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
child_archives: dict[str, 'EntryArchive'] = None,
) -> None:
log_parser = LogParser(mainfile=mainfile, logger=logger)
simulation = Simulation()
program = Program(
name='PySCF', version=log_parser.get('program_version')
)
simulation.program = program
# Add the `Simulation` activity to the `archive`
archive.data = simulation
Now, this populated schema can be output adding a print statement with the path to the section or quantity, after archive.data is assigned in the parser:
--show-archive when running the nomad parse command to print the archive in the terminal:
And we obtain: