Parser architecture¶
Software architecture¶
Mapping between vendor-specific representations and community-agreed data models is a central challenge in research data management. pynxtools-xps solves this for X-ray photoelectron spectroscopy through a three-layer pipeline: parsing, metadata normalization, and template mapping.
The implementation is organized in three explicit layers, each with a clearly bounded responsibility:
- Parsing — vendor-specific
_XPSParsersubclasses extract spectra from raw files into flat dictionaries. - Normalization — per-vendor
_MetadataContextinstances canonicalize key names, values, and units through a deterministic pipeline. - Templating — JSON config files map canonical keys to NeXus paths in
NXxpsorNXmpes, whichXPSReaderuses to write the HDF5 output.
This separation keeps all vendor-specific knowledge confined to
parsers/<vendor>/ subpackages and makes the normalization logic independently testable.
Layer 1 — File parsing¶
Each supported format has a dedicated parser in
src/pynxtools_xps/parsers/<vendor>/parser.py.
Parser hierarchy¶
All parsers inherit from
_Parser,
the abstract base class that defines the extension, version, and structure validation contract.
Two concrete base classes specialize it:
_XPSParser— primary data parsers. Subclasses implementmatches_file()(structural validation) and_parse()(data extraction)._XPSMetadataParser— supplementary parsers for auxiliary files (for example, CasaXPS quantification exports). They inject additional metadata into already-parsed data viaupdate_main_file_data().
Output format¶
Every _XPSParser populates self._data, a list[dict[str, Any]]. Each dictionary
represents one XPS spectrum and contains all raw key-value pairs extracted from the file.
Key names and value formats are vendor-specific at this stage.
Version awareness¶
Parsers can declare which file versions they support via two class variables:
requires_version = True # reject files that carry no version string
supported_versions = (
((1, 1), (4, 0)), # [1.1, 4.0)
((4, 1), (4, 101)), # [4.1, 4.101)
)
Version strings extracted from file headers are tokenized into comparable VersionTuple
objects by
normalize_version
before the range check.
Intervals are half-open: the lower bound is inclusive, the upper bound exclusive.
Layer 2 — Metadata normalization¶
Raw key-value pairs differ in naming, unit encoding, and value representation across
vendors.
_MetadataContext
is a stateless normalization engine that converts a (key, value) pair into a canonical (key, value, unit) triple through a fixed pipeline.
Normalization pipeline¶
Each step in _MetadataContext.format(key, value) runs in a fixed order:
| Step | What it does |
|---|---|
1. normalize_key |
Look up key_map; fall back to PascalCase → snake_case conversion |
2. parse_value_and_unit |
Split inline value+unit strings, e.g. "5.0 eV" → ("5.0", "eV") |
3. resolve_unit_from_key |
Extract unit embedded in key, e.g. "energy [eV]" → key "energy", unit "eV" |
4. get_default_unit |
Assign unit from default_units if none found yet |
5. map_unit |
Normalize abbreviations via unit_map, e.g. "s-1" → "1/s", "norm" → None |
6. map_value |
Apply converter functions from value_map, e.g. _convert_energy_scan_mode |
7. _format_value |
Coerce numeric strings to int or float |
Per-vendor contexts¶
Each vendor subpackage defines a module-level _context instance in
parsers/<vendor>/metadata.py:
_context = _MetadataContext(
key_map=_KEY_MAP, # vendor key → canonical name
value_map=_VALUE_MAP, # canonical key → converter function
unit_map=_UNIT_MAP, # vendor unit string → standard unit
default_units=_DEFAULT_UNITS, # canonical key → unit when not stated explicitly
)
The shared converter functions (such as _convert_measurement_method and
_convert_energy_scan_mode) live in
mapping.py
and are reused across all vendor contexts.
Layer 3 — Template mapping¶
After normalization, the canonical key-value pairs are written into a NeXus template using
the parser's JSON config file in
src/pynxtools_xps/config/.
The config maps flat dict keys to paths in the target application definition:
Learn more about these config files in the pynxtools documentation: pynxtools > Learn > ... > The MultiFormatReader.
XPSReader (the pynxtools reader plugin), which is a subclass of the pynxtools MultiFormatReader, uses this config to fill the template, merges in ELN-provided metadata for any required fields absent from the raw data, and writes the final .nxs HDF5 file via the pynxtools dataconverter.
The supported application definitions are:
NXxps— the primary target, specialized for X-ray photoelectron spectroscopyNXmpes— the multi-technique photoemission superset
Typed intermediate representation¶
Intermediate data objects used during parsing inherit from
_XPSDataclass,
a base class that enforces type annotations at assignment time.
It coerces compatible values (for example, str → int) and raises TypeError for
values that cannot be converted, keeping parsing logic free of ad-hoc type guards.