Build a parser¶

Your current data is not supported yet? Don't worry, the following how-to will guide you how to write a reader your own data.

This guide covers two scenarios:

Your format is already partially supported — you only need to adjust how fields are mapped to NeXus.
Your format is not supported at all — you need to write a new parser from scratch.

Your format is already supported, but some fields differ¶

Good! The basic functionality to read your data is already in place.

The simplest fix is to supply a custom ELN YAML file with the missing or corrected values. For structural mismatches, consider one of the following before writing a new parser:

Modify a config file. Each vendor has a JSON config in src/pynxtools_xps/config/ that maps flat dict keys to NeXus paths. Adjusting the mapping there is often enough.
Open a pull request. If your change benefits all users of that format, propose it on the GitHub repository.

Adding a completely new format¶

Adding a new format requires creating a vendor subpackage, registering the parser, and providing test data.

1. Set up a development environment¶

Follow the instructions in the development guide.

2. Create the vendor subpackage¶

Add a new directory under src/pynxtools_xps/parsers/ with the following layout:

parsers/<vendor>/
├── __init__.py
├── data_model.py   # typed intermediate representation
├── metadata.py     # _MetadataContext instance
└── parser.py       # _XPSParser subclass

Use an existing parser (for example, parsers/vms/ or parsers/phi/) as a reference for the expected structure.

3. Define typed data models (`data_model.py`)¶

Create dataclasses for each logical record in your format (header, spectrum, region, …) by subclassing _XPSDataclass. _XPSDataclass enforces type annotations at assignment time and coerces compatible types automatically. See the phi/data_model.py for a worked example.

Note

Defining a data model is recommended, but remains optional. It is only applicable if the data model for a vendor format is always the same and can thus be explicitly written down.

4. Define the normalization context (`metadata.py`)¶

Create a module-level _context instance of _MetadataContext with four mappings:

key_map — vendor-specific key names → canonical names
value_map — canonical key → converter function (use the shared converters from mapping.py where possible: _convert_measurement_method, _convert_energy_scan_mode, etc.)
unit_map — vendor unit strings → standard units (None = dimensionless/drop)
default_units — canonical key → unit when the value carries none

See parsers/vms/metadata.py for a typical example.

5. Implement the parser (`parser.py`)¶

Start by subclassing _XPSParser. Implement the following required class variables and methods:

supported_vendor — a string for which vendor / data provider your parser support
supported_file_extensions — a tuple of file extensions that your parser supports
matches_file(file_path) — return True only if the file unambiguously conforms to your format (check headers, magic bytes, or required keywords). The implementation must be fast (read at most a few KB) and must never propagate exceptions — catch all errors and return False.
_parse(file_path) — extract all spectra and populate self._data as a dict[str, ParsedSpectrum], where keys are NeXus entry names (e.g. "Au4f__Survey") and each value is a ParsedSpectrum holding:
data — channel-averaged scan data as xr.DataArray with dims (cycle, scan, energy)
raw — optional per-channel data with dims (cycle, scan, channel, energy)
metadata — flat dict[str, Any] of canonical key-value pairs

If your format embeds a version string, override detect_version(file_path) to return a VersionTuple (produced by normalize_version) and declare supported_versions as a class variable (non-empty ranges also implicitly require the file to carry a version).

Set config_file to the name of the JSON config you will create in step 7.

6. Register the parser¶

Import and re-export the new parser class from parsers/__init__.py.
Import the parser in reader.py and add it to the parsers list inside XPSReader. There is no separate mapper layer — parsers are registered directly.

7. Add a config file¶

Create a config/config_<vendor>.json config file by starting from config/template.json. Each entry maps a canonical dict key to the NeXus path in NXxps or NXmpes.

8. Add tests¶

Place at least one representative example file in tests/data/<vendor>/.
Add your test case to the test_cases list in tests/test_reader.py.
Generate the reference .nxs file by running the conversion once and inspecting the output. You may also modify scripts/generate_reference_files.sh to automate this.
Run the full test suite to confirm nothing is broken:

pytest -sv tests

9. Add documentation¶

Add a page to the References section of the documentation to describe what your parser does, which files and versions are supported, and how it integrates in the parser landscape.

Note that there exists a special mkdocs macro (in src/pynxtools_xps/mkdocs.py) to automatically generate the supported file formats and versions of a parser.

It can be called within your reference markdown file like this:

{{ parser_version_table("path.to.your.parser", "YourParser") }}

For example, for the SPECSSLEParser located in src/pynxtools_xps/parsers/specs/sle/parser.py, the call is:

{{ parser_version_table("specs.sle.parser", "SPECSSLEParser") }}