Simulation base section¶
In NOMAD, all the simulation metadata is defined in the Simulation section. You can find its Python schema definition in src/nomad_simulations/general.py. This section will appear under the data section for the archive metadata structure of each entry.
The Simulation section inherits from a base section BaseSimulation. In NOMAD, a set of base sections derived from the Basic Formal Ontology (BFO) are defined. We used them to define BaseSimulation as an Activity. The UML diagram is:
BaseSimulation contains the general information about the Program used, as well as general times of the simulation, e.g., the datetime at which it started (datetime) and ended (datetime_end). Simulation contains further information about the specific input and output sections (see below) The detailed UML diagram of quantities and functions defined for Simulation is thus:
Notation for the section attributes in the UML diagram
We included the information of each attributes / quantities after its definition. The notation is:
<name-of-quantity>: <type-of-quantity>, <units-of-quantity>
Thus, cpu1_start: np.float64, s means that there is a quantity named 'cpu1_start' of type numpy.float64 and whose units are 's' (seconds).
We also include the existance of sub-sections by bolding the name, i.e.:
<name-of-sub-section>: <sub-section-definition>
E.g., there is a sub-section under Simulation named 'model_method' whose section defintion can be found in the ModelMethod section. We will represent this sub-section containment in more complex UML diagrams in the future using the containment arrow (see below for an example using Program).
We use double inheritance from EntryData in order to populate the data section in the NOMAD archive. All of the base sections discussed here are subject to the public normalize function in NOMAD. The private function set_system_branch_depth() is related with the ModelSystem base section.
Main sub-sections in Simulation¶
The Simulation base section is composed of 4 main sub-sections:
Program: contains all the program information, e.g.,nameof the program,version, etc.ModelSystem: contains all the system information about geometrical positions of atoms, their states, simulation cells, symmetry information, etc.ModelMethod: contains all the methodological information, and it is divided in two main aspects: the mathematical model or approximation used in the simulation (e.g.,DFT,GW,ForceFields, etc.) and the numerical settings used to compute the properties (e.g., meshes, self-consistent parameters, basis sets settings, etc.).Outputs: contains all the output properties, as well as references to theModelSystemused to obtain such properties. It might also contain information which will populateModelSystem(e.g., atomic occupations, atomic moments, crystal field energies, etc.).
Self-consistent steps, SinglePoint entries, and more complex workflows.
The minimal unit for storing data in the NOMAD archive is an entry. In the context of simulation data, an entry may contain data from a calculation on an individual system configuration (e.g., a single-point DFT calculation) using only the above-mentioned sections of the Simulation section. Information from self-consistent iterations to converge properties for this configuration are also contained within these sections.
More complex calculations that involve multiple configurations require the definition of a workflow section within the archive. Depending on the situation, the information from individual workflow steps may be stored within a single or multiple entries. For example, for efficiency, the data from workflows involving a large amount of configurations, e.g., molecular dynamics trajectories, are stored within a single entry. Other standard workflows store the single-point data in separate entries, e.g., a GW calculation is composed of a DFT SinglePoint entry and a GW SinglePoint entry. Higher-level workflows, which simply connect a series of standard or custom workflows, are typically stored as a separate entry. You can check the NOMAD simulations workflow schema for more information.
The following schematic represents a simplified representation of the Simulation section (note that the arrows here are a simple way of visually defining inputs and outputs):
Program¶
The Program base section contains all the information about the program / software / code used to perform the simulation. We consider it to be a (Continuant) Entity and contained within BaseSimulation as a sub-section. The detailed UML diagram is:
When writing a parser, we recommend to start by instantiating the Program section and populating its quantities, in order to get acquainted with the NOMAD parsing infrastructure.
For example, imagine we have a file which we want to parse with the following information:
We can parse the program name and version by matching the texts (see, e.g., Wikipedia page for Regular expressions, also called regex):
from nomad.parsing.file_parser import TextParser, Quantity
from nomad_simulations import Simulation, Program
class SUPERCODEParser:
"""
Class responsible to populate the NOMAD `archive` from the files given by a
SUPERCODE simulation.
"""
def parse(self, filepath, archive, logger):
output_parser = TextParser(
quantities=[
Quantity('program_version', r'version *([\d\.]+) *', repeats=False)
]
)
output_parser.mainfile = filepath
simulation = Simulation()
simulation.program = Program(
name='SUPERCODE',
version=output_parser.get('program_version'),
)
# append `Simulation` as an `archive.data` section
archive.data.append(simulation)
Homogenization and the role of Workflows¶
Workflows are a device for annotating the structure of a set of entries in a standardized way. A community can define a workflow schema, i.e. its standout sections, without any knowledge of the underlying entries. As such, workflows define a homogenized data format with rich semantics that act as the starting point wherefrom to explore the dataset. The actual workflow entry instance, meanwhile, handles the coordination between tasks and the underlying data. This mapping may be a mixture of (workflow) entries and sections.
Below are a few examples of actual mappings.
Important to note is that these examples already presuppose a certain structure on the side of the referenced archive.data sections.
In reality, workflow should be capable of hosting multiple underlying structures.
Serial Updates to ModelSystem¶
Geometry optimizations or molecular dynamics simulations, for example, typically store their data in a single entry.
Tasks trace the updates to the system, i.e. calculation or time steps, respectively.
The actual modifications of these steps are stored in the model_system and outputs.
entry_x#workflow2.task[0].input -> entry_y1#data.model_system[0]
entry_x#workflow2.task[0].output -> entry_y2#data.outputs[0]
...
entry_x#workflow2.task[n].input -> entry_y1#data.model_system[n]
entry_x#workflow2.task[n].output -> entry_y2#data.outputs[n]
Note that entry_x, entry_y1, entry_y2 will typically refer to the same entry, though this isn't a hard requirement.
Including SCF steps¶
Under construction
This section will be updated once its schema-side is settled on.
Single-Point SCF Workflow¶
In the case of a single-point calculation, the emphasis clearly lies on the relaxation of the electronic structure.
Now, each outputs section can wholesale be dedicated to following the SCF cycle.
Multi-step Electronic Workflows¶
There are two main design choices here:
- the methodology and computed outputs are split along major subroutines.
- they are kept in a single entry. This is especially useful for legacy cases, where some subroutines were originally not distinguished.
In option 2, for any workflows at are not simply serial, there is no canonical way of ordering outputs.
This burden remains with workflow2.tasks.