Creating parser plugins¶
The role of a parser is to map between a structure or unstructure file into a well-defined, standardized, and (ideally) FAIR data schema. In our case, we are interested on mapping simulation data into the nomad-simulations
schema. When the sections defined in nomad-simulations
are not enough, any user can extend and tune the schema to their specific needs, see Extending the schema.
In NOMAD, when an user creates an upload and push some files, a processing occurs. This processing involves several steps, being matching and parsing the most important and relevant ones:
- Matching the file(s) to the relevant parser class called
<ParserName>Parser
. Only a specific file is matched with the parser class, which receives the name of mainfile. - Call into the
<ParserName>Parser
class functionparse()
. - The
parse()
function populates the archive from the mainfile and other auxiliary files which might be relevant. This archive contains of the relevant metadata in a NOMAD entry.
You can find more information about processing the NOMAD documentation page, see Explanation - Processing. Schematically, this processing can be visualized as:
In this page, we are going to learn how to create a parser, its structure and basic functionalities. NOMAD parsers are plugins, and thus can be defined in their own repositories and be developed independently of the main software. In case of administrating a NOMAD Oasis, you can read more about connecting plugins in the NOMAD documentation, see NOMAD Oasis - Install plugins.
Starting a plugin project¶
To create your own parser plugin, visit the NOMAD plugin template and click the “Use this template” button (you need a Github account to do so):
You can decide where to host the parser plugin. Once this is done, in your local machine you can clone the generated repository. For the purpose of this example, we will use JosePizarro3/example-plugin:
All the steps we are going to do can be found in the branch pyscf-new-template
.
Go to the example-plugin
directory and create a virtual environment with Python 3.9, 3.10, or 3.11, and activate it:
Install uv
(a very fast installer of Python packages) and cruft
, and follow the instructions in the README.md
to generate the parser:
pip install --upgrade pip
pip install uv cruft
cruft create https://github.com/FAIRmat-NFDI/cookiecutter-nomad-plugin
You will be prompted with some questions and information regarding the plugin. Make sure to include both parser
and schema_package
options (the latter will be use in the Extending the Schema part):
[1/13] full_name (John Doe): <whatever-name>
[2/13] email (john.doe@physik.hu-berlin.de): <whatever-email>
[3/13] github_username (Github organization or profile name, default: foo): <whatever-github-name>
[4/13] plugin_name (foobar): nomad-parser-pyscf
[5/13] module_name (recommended: press enter to use the default module name)
(nomad_parser_pyscf):
[6/13] short_description (Nomad example template): NOMAD parser plugin for PySCF simulations output in a log text file.
[7/13] version (0.1.0):
[8/13] Select license
1 - MIT
2 - BSD-3
3 - GNU GPL v3.0+
4 - Apache Software License 2.0
Choose from [1/2/3/4] (1): <whatever-license>
[9/13] include_schema_package [y/n] (y): y
[10/13] include_normalizer [y/n] (y): n
[11/13] include_parser [y/n] (y): y
[12/13] include_app [y/n] (y): n
[13/13] include_example_uploads [y/n] (y): n
You can use the script under the generated folder to move all files one level up:
The structure of the plugin is (without including the docs
and mkdocs.yml
files):
example-plugin/
├── README.md
├── LICENSE
├── pyproject.toml
├── src/
│ └── nomad_parser_pyscf/
│ ├── __init__.py
│ ├── parsers/
│ │ ├── __init__.py
│ │ └── parser.py
│ └── schema_packages/
│ ├── __init__.py
│ └── schema_package.py
├── tests/
│ ├── conftest.py
│ ├── data/
│ │ ├── example.out
│ │ └── test.archive.yaml
│ ├── parsers/
│ │ └── test_parser.py
│ └── schema_packages/
│ └── test_schema_package.py
└── ... (other files)
You can read more about plugins in the NOMAD documentation page, see How to get started with plugins. The entry point for a parser plugin is defined in src/nomad_parser_pyscf/parsers/__init__.py
file.
Matching the files to a parser¶
We are going to consider an example file generated by the software PySCF (click here for download), and use it to parse data into the nomad-simulations
schema. If you want to learn more about extending the nomad-simulations
schema when the information is not defined, see Extending the Schema.
In the template, go to the entry point in src/nomad_parser_pyscf/parsers/__init__.py
and change the content to:
from nomad.config.models.plugins import ParserEntryPoint
class PySCFEntryPoint(ParserEntryPoint):
def load(self):
from nomad_parser_pyscf.parsers.parser import PySCFParser
return PySCFParser(**self.dict())
parser_entry_point = PySCFEntryPoint(
name='PySCFParser',
description='Parser for PySCF output written in a log text file.',
mainfile_name_re='.*\.log.*',
mainfile_contents_re=r'PySCF version [\d\.]*',
)
Matching other mainfiles
In the ParserEntryPoint
class, there are several attributes that can be defined to match a file. These can be mainfile name, a regular expression (regex) at the beginning of the mainfile, binary file headers, etc. For your specific use-case, you need to identify which is the mainfile and use it to match it to your parser. You can read more of the options in the NOMAD documentation page Reference - Configuration - ParserEntryPoint.
Note that we deleted the configuration parameter field (as it is not relevant for the purpose of this example) and changed some naming of files, src/nomad_parser_pyscf/parsers/parser.py
for parser.py
, as well as slightly its content:
# all the other imports are here
configuration = config.get_plugin_entry_point(
'nomad_parser_pyscf.parsers:parser_entry_point'
)
class PySCFParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
child_archives: dict[str, 'EntryArchive'] = None,
) -> None:
print('Hello world!')
Note that we have also changed the class imported in tests/parsers/test_parser.py
from NewParser
to PySCFParser
.
Finally, as we are going to use the nomad-simulations
package, we need to add it into our dependencies in pyproject.toml
:
In order to test the parser, we can install parser plugin in editable mode (with the flag -e
):
uv pip install -e '.[dev]' --index-url https://gitlab.mpcdf.mpg.de/api/v4/projects/2187/packages/pypi/simple
Installing nomad-lab
Until we have an official pypi NOMAD release with the plugins functionality make sure to include NOMAD's internal package registry (via --index-url
in the above command). Alternatively, you can add the following lines to the pyproject.toml
:
With these changes, you can parse the glycine.log
file and show the archive in the terminal (note we put the glycine.log
file in the tests/data/
sub-folder):
If everything went well, you should see a message run by the PySCFParser().parse()
function:
In Python (in a module, Jupyter notebook, or in the interactive terminal), you can test the parsing by importing PySCFParser
and specifying the path to the file. Note we also need to instantiate an empty EntryArchive
in order to pass it as an input of the parse()
function:
from nomad.datamodel import EntryArchive
from nomad_parser_pyscf.parsers.parser import PySCFParser
archive = EntryArchive()
PySCFParser().parse(mainfile='<path-to-mainfile>/tests/data/glycine.log', archive=archive, logger=None)
# Other operations / analyses here
Mapping into nomad-simulations
¶
Now that you know the basics of instantiating and populating the nomad-simulations
schema (see Assignments in the NOMAD-Simulations) and the basics of setting up a parser, let's combine both concepts and populate the nomad-simulations
schema from the output mainfile. The implementation of this plugin will allow you to manage data in a controlled way. You can further use it for analysis tools or include these functionalities in your research workflows. You can read more details in the NOMAD documentation page Plugins - How to write a parser.
Depending on the format of the mainfile, extracting data will be different and you will need to implement the parsing slightly different. You might also need to manipulate the data in order to adapt to the definitions of nomad-simulations
(shapes, types, etc.):
- Unstructured text – You need to specify the regular expression (regex) to match the text in the file in order to use it in the
parser.py
module. This is the case of our example, so you can read below how to implement it. - Structured formats (HDF5, JSON, XML) – You can use Python libraries to extract the data (e.g.,
h5py
for HDF5 files) and map it into thenomad-simulations
schema. In the specific case of XML, you can use theXMLParser
.
In our example, we have an unstructured text file, glycine.log
. For simplicity, we will start with the Program
information: name
and version
. For unstructure text, we can use the NOMAD implementation class, TextParser
. We need to:
- Define a class inheriting from
TextParser
. - Overwrite its method
init_quantities()
and defineself._quantities
as a list ofnomad.parsing.file_parser.Quantity
. Note this is a different class than the one used for defining a schema, and we will refer to it asParsedQuantity
from now on. ParsedQuantity()
has akey: value
structure and can take several arguments, being the most important ones:- First argument is the
key
string to identify the parsed quantity. - Second argument is the regex associated with that
key
to match the text. repeats
is a boolean used to repeat multiple times. Ifrepeats=True
and the regex matches, it returns a list. Otherwise it returns nothing or a singular value.
- First argument is the
- Instantiate the class of point 1 in the
parse()
function and define the path to its mainfile.
We can implement these steps in src/nomad_parser_pyscf/parsers/parser.py
. First, we add these lines before PySCFParser(MatchingParser)
:
from typing import (
TYPE_CHECKING,
)
if TYPE_CHECKING:
from nomad.datamodel.datamodel import (
EntryArchive,
)
from structlog.stdlib import (
BoundLogger,
)
from nomad.config import config
from nomad.parsing.parser import MatchingParser
from nomad.parsing.file_parser import Quantity as ParsedQuantity, TextParser
configuration = config.get_plugin_entry_point(
'nomad_parser_pyscf.parsers:parser_entry_point'
)
class LogParser(TextParser):
def init_quantities(self):
self._quantities = [
ParsedQuantity(
'program_version', r'PySCF *version *([\d\.]+)', repeats=False
)
]
Note we defined LogParser()
and its self._quantities
to match the version
of the PySCF run. The name
will be anyways set up to be 'PySCF'
during parsing.
If the regex matches, then this should return the value appearing after the string 'PySCF version'
, i.e., '2.2.1'
. We can test this by:
class PySCFParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
child_archives: dict[str, 'EntryArchive'] = None,
) -> None:
log_parser = LogParser(mainfile=mainfile, logger=logger)
print(log_parser.get('program_version'))
If we run now nomad parse tests/data/glycine.log
we obtain:
Now, we can instantiate Simulation
and Program
sections, and add them to the archive
. This can be done by:
# other imports here
from nomad_simulations.schema_packages.general import Simulation, Program
# `configuration` and `LogParser` defined here
class PySCFParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: 'EntryArchive',
logger: 'BoundLogger',
child_archives: dict[str, 'EntryArchive'] = None,
) -> None:
log_parser = LogParser(mainfile=mainfile, logger=logger)
simulation = Simulation()
program = Program(
name='PySCF', version=log_parser.get('program_version')
)
simulation.program = program
# Add the `Simulation` activity to the `archive`
archive.data = simulation
Now, this populated schema can be output adding a print
statement with the path to the section or quantity, after archive.data
is assigned in the parser:
--show-archive
when running the nomad parse
command to print the archive in the terminal:
And we obtain: