Parser structure for computation¶
Overview of metadata organization for computation¶
NOMAD stores all processed data in a well defined, structured, and machine readable format, known as the archive
.
The schema that defines the organization of (meta)data within the archive is known as the MetaInfo
.
More information can be found in the NOMAD docs: An Introduction to Schemas and Structured Data in NOMAD.
The following diagram is an overarching visualization of the most important archive sections for computational data:
archive
├── run
│ ├── method
│ │ ├── atom_parameters
│ │ ├── dft
│ │ ├── forcefield
│ │ └── ...
│ ├── system
│ │ ├── atoms
│ │ │ ├── positions
│ │ │ ├── lattice_vectors
│ │ │ └── ...
│ │ └── ...
│ └── calculation
│ ├── energy
│ ├── forces
│ └── ...
└── workflow
├── method
├── inputs
├── tasks
├── outputs
└── results
The most important section of the archive for computational data is the run
section, which is
divided into three main subsections: method
, system
, and calculation
. method
stores
information about the computational model used to perform the calculation.
system
stores
attributes of the atoms involved in the calculation, e.g., atom types, positions, lattice vectors, etc.
calculation
stores the output of the calculation, e.g., energy, forces, etc.
The workflow
section of the archive then stores information about the series of tasks performed
to accumulate the (meta)data in the run section. The relevant input parameters for the workflow are
stored in method
, while the results
section stores output from the workflow beyond observables
of single configurations.
I think this better summarizes the general rule.
For example, any ensemble-averaged quantity from a molecular dynamics
simulation would be stored under workflow/results
. Then, the inputs
, outputs
, and tasks
sections define the specifics of the workflow.
For some standard workflows, e.g., geometry optimization and molecular dynamics, the NOMAD normalizers will automatically populate these specifics. The parser must only create the appropriate workflow section.
For non-standard workflows, the parser (or more appropriately the corresponding normalizer) must populate these sections accordingly. More information about the structure of the workflow section, as well as instructions on how to upload custom workflows to link individual Entries in NOMAD, can be found HERE
Recommended parser layout¶
The following represents the recommended core structure for a computational parser,
typically implemented within <parserproject>/parser.py
.
Imports¶
The imports typically include not only the necessary generic python modules, but also the required MetaInfo classes from nomad and additional nomad utilities:
<license>
import os
import numpy as np
from nomad.datamodel.metainfo.simulation.system import System, Atoms
from nomad.atomutils import get_molecules_from_bond_list
from nomad.units import ureg
For example, above the classes System
and Atoms
are imported from the nomad MetaInfo definitions in order to appropriately build and populate the NOMAD archive.
The atomutils
NOMAD module contains many useful functions for processing computational data.
Finally, the UnitRegistry
module of pint
(imported directly from NOMAD in this case), provides support to defining and converting units.
Parser Classes¶
class <Parsername><Mainfiletype>Parser(FileParser):
def __init__(self):
super().__init__(None)
@property
def file<mainfiletype>(self):
if self._file_handler is None:
try:
self._file_handler = <openfilefunction>(self.mainfile, 'rb')
except Exception:
self.logger.error('Error reading <mainfiletype> file.')
return self._file_handler
class <Parsername>Parser:
def __init__(self):
self.<mainfiletype>_parser = <Parsername><Mainfiletype>Parser()
The main class for your parser, <Parsername>Parser
, will contain the bulk of the parsing routine,
described further below. It may be useful to create a distinct class, <Parsername><Mainfiletype>Parser
,
for dealing with various filetypes that may be parsed throughout the entirety of the routine.
However, in the simplest case of a single file type parsed, the entire parser can of course be
implemented within a single class.
In the following, we will walk through the layout of the <Parsername>Parser
class. First,
every parser class should have a "main" function called parse()
, which will be called by NOMAD
when the appropriate mainfile is found:
def parse(self, filepath, archive, logger):
self.filepath = os.path.abspath(filepath)
self.archive = archive
self.logger = logging.getLogger(__name__) if logger is None else logger
self._maindir = os.path.dirname(self.filepath)
self._<parsername>_files = os.listdir(self._maindir) # get the list of files in the same directory as the mainfile
self._basename = os.path.basename(filepath).rsplit('.', 1)[0]
self.init_parser()
if self.<mainfileparser>_parser is None:
return
sec_run = self.archive.m_create(Run)
sec_run.program = Program(name='<PARSERNAME>', version='unknown')
self.parse_method()
self.parse_system()
self.parse_calculation()
self.parse_workflow()
Then, the individual functions to populate that various MetaInfo sections can be defined:
def parse_calculation(self):
sec_run = self.archive.run[-1]
sec_calc = sec_run.m_create(Calculation)
# populate calculation metainfo
# ...
def parse_system(self, frame):
sec_run = self.archive.run[-1]
sec_system = sec_run.m_create(System)
sec_atoms = sec_system.m_create(Atoms)
# populate system metainfo
# ...
def parse_method(self, frame):
sec_method = self.archive.run[-1].m_create(Method)
sec_force_field = sec_method.m_create(ForceField)
sec_model = sec_force_field.m_create(Model)
# populate method metainfo
# ...
def parse_workflow(self):
sec_workflow = self.archive.m_create(Workflow) # for old workflow, should update for workflow2
# populate workflow metainfo
# ...
For more information, see Examples - populating the NOMAD archive .