Parser architecture¶
Software architecture¶
Mapping between vendor-specific representations and community-agreed data models is a central challenge in research data management. pynxtools-xps solves this for X-ray photoelectron spectroscopy through a three-layer pipeline: parsing, metadata normalization, and template mapping.
The implementation is organized in three explicit layers, each with a clearly bounded responsibility:
- Parsing — vendor-specific
_XPSParsersubclasses extract spectra from raw files into typedParsedSpectrumobjects. - Normalization — per-vendor
_MetadataContextinstances canonicalize key names, values, and units through a deterministic pipeline. - Templating — JSON config files map canonical keys to NeXus paths in
NXxpsorNXmpes, whichXPSReaderuses to write the HDF5 output.
This separation keeps all vendor-specific knowledge confined to
parsers/<vendor>/ subpackages and makes the normalization logic independently testable.
Layer 1 — File parsing¶
Each supported format has a dedicated parser in
src/pynxtools_xps/parsers/<vendor>/parser.py.
Parser hierarchy¶
All parsers inherit from
_Parser,
the abstract base class that defines the extension, version, and structure validation contract.
It declares two @abstractmethod hooks that every subclass must implement:
matches_file() (positive format identification) and _parse() (data extraction).
Two concrete base classes specialize it:
_XPSParser— primary data parsers._XPSMetadataParser— supplementary parsers for auxiliary files (for example, CasaXPS quantification exports). They inject additional metadata into already-parsed data viaupdate_main_file_data().
Output format¶
Every _XPSParser populates self._data, a dict[str, ParsedSpectrum]. Keys are NeXus
entry names (e.g. "SampleName__Survey"). Each
ParsedSpectrum
holds three fields:
data— channel-averaged scan data as anxr.DataArraywith required dimensions("cycle", "scan")followed by one or more physical axes (typically"energy"). Usen_cycles=1for formats without an explicit loop structure.raw— optional per-channel data with required dimensions("cycle", "scan", "channel")plus the same physical axes asdata.metadata— flatdict[str, Any]of canonical key-value pairs for@attrs:lookups in config files.
ParsedSpectrum exposes pre-built aggregations:
| Method | Returns |
|---|---|
average() |
Mean across all cycles and scans — shape (*axes,) |
errors() |
Standard deviation across all cycles and scans — shape (*axes,) |
scan_average() |
Mean across scans within each cycle — shape (cycle, *axes) |
cycle_average() |
Scan average then mean across cycles — shape (*axes,) |
These are consumed by XPSReader to fill the NXprocess groups defined in the config
(PROCESS[scan_averaging], PROCESS[cycle_averaging], etc.).
Version awareness¶
Parsers that need to constrain which file versions they accept declare supported_versions:
Intervals are half-open: the lower bound is inclusive, the upper bound exclusive.
A None upper bound means unbounded (>= lower).
When supported_versions is empty (the default), all files are accepted regardless
of whether they carry a version string. When non-empty, files without a version are
implicitly rejected — a declared range implies a version is required.
Version strings extracted from file headers are tokenized into comparable VersionTuple
objects by
normalize_version
before the range check.
Layer 2 — Metadata normalization¶
Raw key-value pairs differ in naming, unit encoding, and value representation across
vendors.
_MetadataContext
is a stateless normalization engine that converts a (key, value) pair into a canonical (key, value, unit) triple through a fixed pipeline.
Normalization pipeline¶
Each step in _MetadataContext.format(key, value) runs in a fixed order:
| Step | What it does |
|---|---|
1. normalize_key |
Look up key_map; fall back to PascalCase → snake_case conversion |
2. parse_value_and_unit |
Split inline value+unit strings, e.g. "5.0 eV" → ("5.0", "eV") |
3. resolve_unit_from_key |
Extract unit embedded in key, e.g. "energy [eV]" → key "energy", unit "eV" |
4. get_default_unit |
Assign unit from default_units if none found yet |
5. map_unit |
Normalize abbreviations via unit_map, e.g. "s-1" → "1/s", "norm" → None |
6. map_value |
Apply converter functions from value_map, e.g. _convert_energy_scan_mode |
7. _format_value |
Coerce numeric strings to int or float |
Per-vendor contexts¶
Each vendor subpackage defines a module-level _context instance in
parsers/<vendor>/metadata.py:
_context = _MetadataContext(
key_map=_KEY_MAP, # vendor key → canonical name
value_map=_VALUE_MAP, # canonical key → converter function
unit_map=_UNIT_MAP, # vendor unit string → standard unit
default_units=_DEFAULT_UNITS, # canonical key → unit when not stated explicitly
)
The shared converter functions (such as _convert_measurement_method and
_convert_energy_scan_mode) live in
mapping.py
and are reused across all vendor contexts.
Layer 3 — Template mapping¶
After normalization, the canonical key-value pairs are written into a NeXus template using
the parser's JSON config file in
src/pynxtools_xps/config/.
The config maps flat dict keys to paths in the target application definition:
Learn more about these config files in the pynxtools documentation: pynxtools > Learn > ... > The MultiFormatReader.
XPSReader (the pynxtools reader plugin), which is a subclass of the pynxtools MultiFormatReader, uses this config to fill the template, merges in ELN-provided metadata for any required fields absent from the raw data, and writes the final .nxs HDF5 file via the pynxtools dataconverter.
The supported application definitions are:
NXxps— the primary target, specialized for X-ray photoelectron spectroscopyNXmpes— the multi-technique photoemission superset
Typed intermediate representation¶
Two typed structures enforce correctness during and after parsing:
_XPSDataclass — vendor-internal data models.
Dataclasses for each logical record in a format (header, spectrum region, …) inherit from
_XPSDataclass,
which enforces type annotations at assignment time.
It coerces compatible values (for example, str → int) and raises TypeError for
values that cannot be converted, keeping parsing logic free of ad-hoc type guards.
These models are internal to each vendor subpackage (in data_model.py).
ParsedSpectrum — the public output of every _XPSParser.
After a parser's _parse() runs, results are exposed as dict[str, ParsedSpectrum]
accessible via the data property. See Layer 1 — Output format above
for the full field specification.