Parser architecture¶

Software architecture¶

Mapping between vendor-specific representations and community-agreed data models is a central challenge in research data management. pynxtools-xps solves this for X-ray photoelectron spectroscopy through a three-layer pipeline: parsing, metadata normalization, and template mapping.

The implementation is organized in three explicit layers, each with a clearly bounded responsibility:

Parsing — vendor-specific _XPSParser subclasses extract spectra from raw files into typed ParsedSpectrum objects.
Normalization — per-vendor _MetadataContext instances canonicalize key names, values, and units through a deterministic pipeline.
Templating — JSON config files map canonical keys to NeXus paths in NXxps or NXmpes, which XPSReader uses to write the HDF5 output.

This separation keeps all vendor-specific knowledge confined to parsers/<vendor>/ subpackages and makes the normalization logic independently testable.

flowchart LR A[Vendor file\n.sle / .vms / .xy / …] --> B[Parser\n_XPSParser] B --> C[ParsedSpectrum objects\ndict keyed by entry name] C --> D[_MetadataContext\nnormalize key / value / unit] D --> E[Canonical key-value pairs] E --> F[JSON config\nNeXus paths] F --> G[`NXxps` / `NXmpes`\n.nxs HDF5 file]

Layer 1 — File parsing¶

Each supported format has a dedicated parser in src/pynxtools_xps/parsers/<vendor>/parser.py.

Parser hierarchy¶

All parsers inherit from _Parser, the abstract base class that defines the extension, version, and structure validation contract. It declares two @abstractmethod hooks that every subclass must implement: matches_file() (positive format identification) and _parse() (data extraction). Two concrete base classes specialize it:

_XPSParser — primary data parsers.
_XPSMetadataParser — supplementary parsers for auxiliary files (for example, CasaXPS quantification exports). They inject additional metadata into already-parsed data via update_main_file_data().

Output format¶

Every _XPSParser populates self._data, a dict[str, ParsedSpectrum]. Keys are NeXus entry names (e.g. "SampleName__Survey"). Each ParsedSpectrum holds three fields:

data — channel-averaged scan data as an xr.DataArray with required dimensions ("cycle", "scan") followed by one or more physical axes (typically "energy"). Use n_cycles=1 for formats without an explicit loop structure.
raw — optional per-channel data with required dimensions ("cycle", "scan", "channel") plus the same physical axes as data.
metadata — flat dict[str, Any] of canonical key-value pairs for @attrs: lookups in config files.

ParsedSpectrum exposes pre-built aggregations:

Method	Returns
`average()`	Mean across all cycles and scans — shape `(*axes,)`
`errors()`	Standard deviation across all cycles and scans — shape `(*axes,)`
`scan_average()`	Mean across scans within each cycle — shape `(cycle, *axes)`
`cycle_average()`	Scan average then mean across cycles — shape `(*axes,)`

These are consumed by XPSReader to fill the NXprocess groups defined in the config (PROCESS[scan_averaging], PROCESS[cycle_averaging], etc.).

Version awareness¶

Parsers that need to constrain which file versions they accept declare supported_versions:

supported_versions = (
    ((1, 1), (4, 0)),   # [1.1, 4.0)
    ((4, 1), (4, 101)), # [4.1, 4.101)
)

Intervals are half-open: the lower bound is inclusive, the upper bound exclusive. A None upper bound means unbounded (>= lower).

When supported_versions is empty (the default), all files are accepted regardless of whether they carry a version string. When non-empty, files without a version are implicitly rejected — a declared range implies a version is required.

Version strings extracted from file headers are tokenized into comparable VersionTuple objects by normalize_version before the range check.

Layer 2 — Metadata normalization¶

Raw key-value pairs differ in naming, unit encoding, and value representation across vendors. _MetadataContext is a stateless normalization engine that converts a (key, value) pair into a canonical (key, value, unit) triple through a fixed pipeline.

Normalization pipeline¶

Each step in _MetadataContext.format(key, value) runs in a fixed order:

Step	What it does
1. `normalize_key`	Look up `key_map`; fall back to PascalCase → snake_case conversion
2. `parse_value_and_unit`	Split inline value+unit strings, e.g. `"5.0 eV"` → `("5.0", "eV")`
3. `resolve_unit_from_key`	Extract unit embedded in key, e.g. `"energy [eV]"` → key `"energy"`, unit `"eV"`
4. `get_default_unit`	Assign unit from `default_units` if none found yet
5. `map_unit`	Normalize abbreviations via `unit_map`, e.g. `"s-1"` → `"1/s"`, `"norm"` → `None`
6. `map_value`	Apply converter functions from `value_map`, e.g. `_convert_energy_scan_mode`
7. `_format_value`	Coerce numeric strings to `int` or `float`

Per-vendor contexts¶

Each vendor subpackage defines a module-level _context instance in parsers/<vendor>/metadata.py:

_context = _MetadataContext(
    key_map=_KEY_MAP,           # vendor key → canonical name
    value_map=_VALUE_MAP,       # canonical key → converter function
    unit_map=_UNIT_MAP,         # vendor unit string → standard unit
    default_units=_DEFAULT_UNITS,  # canonical key → unit when not stated explicitly
)

The shared converter functions (such as _convert_measurement_method and _convert_energy_scan_mode) live in mapping.py and are reused across all vendor contexts.

Layer 3 — Template mapping¶

After normalization, the canonical key-value pairs are written into a NeXus template using the parser's JSON config file in src/pynxtools_xps/config/. The config maps flat dict keys to paths in the target application definition:

"/ENTRY[entry]/instrument/analyser/pass_energy": "pass_energy"

Learn more about these config files in the pynxtools documentation: pynxtools > Learn > ... > The MultiFormatReader.

XPSReader (the pynxtools reader plugin), which is a subclass of the pynxtools MultiFormatReader, uses this config to fill the template, merges in ELN-provided metadata for any required fields absent from the raw data, and writes the final .nxs HDF5 file via the pynxtools dataconverter.

The supported application definitions are:

NXxps — the primary target, specialized for X-ray photoelectron spectroscopy
NXmpes — the multi-technique photoemission superset

Typed intermediate representation¶

Two typed structures enforce correctness during and after parsing:

_XPSDataclass — vendor-internal data models. Dataclasses for each logical record in a format (header, spectrum region, …) inherit from _XPSDataclass, which enforces type annotations at assignment time. It coerces compatible values (for example, str → int) and raises TypeError for values that cannot be converted, keeping parsing logic free of ad-hoc type guards. These models are internal to each vendor subpackage (in data_model.py).

ParsedSpectrum — the public output of every _XPSParser. After a parser's _parse() runs, results are exposed as dict[str, ParsedSpectrum] accessible via the data property. See Layer 1 — Output format above for the full field specification.