Skip to content

Parser architecture

Software architecture

Mapping between vendor-specific representations and community-agreed data models is a central challenge in research data management. pynxtools-xps solves this for X-ray photoelectron spectroscopy through a three-layer pipeline: parsing, metadata normalization, and template mapping.

The implementation is organized in three explicit layers, each with a clearly bounded responsibility:

  1. Parsing — vendor-specific _XPSParser subclasses extract spectra from raw files into typed ParsedSpectrum objects.
  2. Normalization — per-vendor _MetadataContext instances canonicalize key names, values, and units through a deterministic pipeline.
  3. Templating — JSON config files map canonical keys to NeXus paths in NXxps or NXmpes, which XPSReader uses to write the HDF5 output.

This separation keeps all vendor-specific knowledge confined to parsers/<vendor>/ subpackages and makes the normalization logic independently testable.

flowchart LR A[Vendor file\n.sle / .vms / .xy / …] --> B[Parser\n_XPSParser] B --> C[ParsedSpectrum objects\ndict keyed by entry name] C --> D[_MetadataContext\nnormalize key / value / unit] D --> E[Canonical key-value pairs] E --> F[JSON config\nNeXus paths] F --> G[`NXxps` / `NXmpes`\n.nxs HDF5 file]

Layer 1 — File parsing

Each supported format has a dedicated parser in src/pynxtools_xps/parsers/<vendor>/parser.py.

Parser hierarchy

All parsers inherit from _Parser, the abstract base class that defines the extension, version, and structure validation contract. It declares two @abstractmethod hooks that every subclass must implement: matches_file() (positive format identification) and _parse() (data extraction). Two concrete base classes specialize it:

  • _XPSParser — primary data parsers.
  • _XPSMetadataParser — supplementary parsers for auxiliary files (for example, CasaXPS quantification exports). They inject additional metadata into already-parsed data via update_main_file_data().

Output format

Every _XPSParser populates self._data, a dict[str, ParsedSpectrum]. Keys are NeXus entry names (e.g. "SampleName__Survey"). Each ParsedSpectrum holds three fields:

  • data — channel-averaged scan data as an xr.DataArray with required dimensions ("cycle", "scan") followed by one or more physical axes (typically "energy"). Use n_cycles=1 for formats without an explicit loop structure.
  • raw — optional per-channel data with required dimensions ("cycle", "scan", "channel") plus the same physical axes as data.
  • metadata — flat dict[str, Any] of canonical key-value pairs for @attrs: lookups in config files.

ParsedSpectrum exposes pre-built aggregations:

Method Returns
average() Mean across all cycles and scans — shape (*axes,)
errors() Standard deviation across all cycles and scans — shape (*axes,)
scan_average() Mean across scans within each cycle — shape (cycle, *axes)
cycle_average() Scan average then mean across cycles — shape (*axes,)

These are consumed by XPSReader to fill the NXprocess groups defined in the config (PROCESS[scan_averaging], PROCESS[cycle_averaging], etc.).

Version awareness

Parsers that need to constrain which file versions they accept declare supported_versions:

supported_versions = (
    ((1, 1), (4, 0)),   # [1.1, 4.0)
    ((4, 1), (4, 101)), # [4.1, 4.101)
)

Intervals are half-open: the lower bound is inclusive, the upper bound exclusive. A None upper bound means unbounded (>= lower).

When supported_versions is empty (the default), all files are accepted regardless of whether they carry a version string. When non-empty, files without a version are implicitly rejected — a declared range implies a version is required.

Version strings extracted from file headers are tokenized into comparable VersionTuple objects by normalize_version before the range check.


Layer 2 — Metadata normalization

Raw key-value pairs differ in naming, unit encoding, and value representation across vendors. _MetadataContext is a stateless normalization engine that converts a (key, value) pair into a canonical (key, value, unit) triple through a fixed pipeline.

Normalization pipeline

Each step in _MetadataContext.format(key, value) runs in a fixed order:

Step What it does
1. normalize_key Look up key_map; fall back to PascalCase → snake_case conversion
2. parse_value_and_unit Split inline value+unit strings, e.g. "5.0 eV"("5.0", "eV")
3. resolve_unit_from_key Extract unit embedded in key, e.g. "energy [eV]" → key "energy", unit "eV"
4. get_default_unit Assign unit from default_units if none found yet
5. map_unit Normalize abbreviations via unit_map, e.g. "s-1""1/s", "norm"None
6. map_value Apply converter functions from value_map, e.g. _convert_energy_scan_mode
7. _format_value Coerce numeric strings to int or float

Per-vendor contexts

Each vendor subpackage defines a module-level _context instance in parsers/<vendor>/metadata.py:

_context = _MetadataContext(
    key_map=_KEY_MAP,           # vendor key → canonical name
    value_map=_VALUE_MAP,       # canonical key → converter function
    unit_map=_UNIT_MAP,         # vendor unit string → standard unit
    default_units=_DEFAULT_UNITS,  # canonical key → unit when not stated explicitly
)

The shared converter functions (such as _convert_measurement_method and _convert_energy_scan_mode) live in mapping.py and are reused across all vendor contexts.


Layer 3 — Template mapping

After normalization, the canonical key-value pairs are written into a NeXus template using the parser's JSON config file in src/pynxtools_xps/config/. The config maps flat dict keys to paths in the target application definition:

"/ENTRY[entry]/instrument/analyser/pass_energy": "pass_energy"

Learn more about these config files in the pynxtools documentation: pynxtools > Learn > ... > The MultiFormatReader.

XPSReader (the pynxtools reader plugin), which is a subclass of the pynxtools MultiFormatReader, uses this config to fill the template, merges in ELN-provided metadata for any required fields absent from the raw data, and writes the final .nxs HDF5 file via the pynxtools dataconverter.

The supported application definitions are:

  • NXxps — the primary target, specialized for X-ray photoelectron spectroscopy
  • NXmpes — the multi-technique photoemission superset

Typed intermediate representation

Two typed structures enforce correctness during and after parsing:

_XPSDataclass — vendor-internal data models. Dataclasses for each logical record in a format (header, spectrum region, …) inherit from _XPSDataclass, which enforces type annotations at assignment time. It coerces compatible values (for example, strint) and raises TypeError for values that cannot be converted, keeping parsing logic free of ad-hoc type guards. These models are internal to each vendor subpackage (in data_model.py).

ParsedSpectrum — the public output of every _XPSParser. After a parser's _parse() runs, results are exposed as dict[str, ParsedSpectrum] accessible via the data property. See Layer 1 — Output format above for the full field specification.