Bring your own data¶

This tutorial will guide you through writing a pynxtools reader plugin for the data from your specific experiment.

What should you should know before this tutorial?¶

This is a direct follow-up to the tutorial on building your own pynxtools reader, see Building your first pynxtools reader. If you are unfamiliar with pynxtools, you should go through that tutorial first.

What you will know at the end of this tutorial?¶

You will know

how to set up a reader fpr the data from your specific experiment
how to validate your very own NeXus file
how to upload your NeXus file to NOMAD

About this tutorial¶

Duration: ~3 hours (self-paced)

Goal: Apply the MultiFormatReader pattern to your own instrument data and produce a validated NeXus file.

What to bring: one or more data files from your instrument.

The core idea — always the same three steps¶

No matter what format your data comes in, you always do the same three things:

Step A  Read your file(s) into flat Python dicts on self
Step B  Write a config file that maps dict keys → NeXus paths
Step C  Run the converter and fix validation errors

The only part that changes between techniques and formats is Step A — the reading logic. Steps B and C are identical to Day 1.

Step 1 — Know your format (~30 min)¶

Before writing any reader code, understand what you are working with.

Identify your format¶

Format	Typical extensions	How to recognize
HDF5 / NeXus	`.h5`, `.hdf5`, `.nxs`	Binary; starts with `\x89HDF`
HDF5 (instrument brand)	`.h5m`, `.hsp`, `.he5`, …	Same magic bytes; vendor-specific internal layout
VAMAS	`.vms`, `.vamas`	First line: `VAMAS Surface Chemical Analysis`
Igor Pro wave	`.ibw`	Binary with `IGOR` header
CSV / TSV	`.csv`, `.txt`, `.dat`, `.asc`	Human-readable columns
JSON	`.json`	`{` or `[` as first non-whitespace character
YAML	`.yaml`, `.yml`	Key-value pairs with indentation
NetCDF	`.nc`, `.cdf`, `.netcdf`	Binary; readable with `netCDF4` or `xarray`
TIFF (detector images)	`.tiff`, `.tif`	Binary image; use `tifffile` or `PIL`

Step 2 — Read your file into a flat dict (~45 min)¶

Pick the section below that matches your format. The goal in every case: populate self.data (or self.hdf5_data) with a flat dict of "path/to/quantity" → value.

Format: HDF5 (any vendor)¶

This is the same recursive reader from Day 1. It works for any HDF5 file — vendor-specific layouts, NeXus files, everything.

import h5py

def handle_hdf5_file(self, file_path: str) -> dict[str, Any]:
    def _recurse(group, path=""):
        result = {}
        for key, item in group.items():
            full = f"{path}/{key}" if path else key
            if isinstance(item, h5py.Group):
                result.update(_recurse(item, full))
            elif isinstance(item, h5py.Dataset):
                result[full] = item[()]
        # also capture group-level attributes
        for attr_key, attr_val in group.attrs.items():
            result[f"{path}/@{attr_key}" if path else f"@{attr_key}"] = attr_val
        return result

    with h5py.File(file_path, "r") as hdf:
        self.hdf5_data = _recurse(hdf)
    return {}

After running, print the keys:

r.handle_hdf5_file("your_file.h5")
for k in sorted(r.hdf5_data):
    print(k)

Map what you see to what NXsimple (or your application definition) needs.

Format: CSV / TSV / columnar text¶

import numpy as np

def handle_csv_file(self, file_path: str) -> dict[str, Any]:
    # Adjust delimiter, skiprows, and encoding for your file
    data = np.genfromtxt(
        file_path,
        delimiter=",",    # "\t" for TSV, None for whitespace
        names=True,       # use first row as column names
        encoding="utf-8",
        skip_header=0,
    )
    self.data = {name: data[name] for name in data.dtype.names}
    return {}

Or with pandas (more flexible for messy headers):

import pandas as pd

def handle_csv_file(self, file_path: str) -> dict[str, Any]:
    # Read metadata lines before the data block (if any)
    meta = {}
    data_start = 0
    with open(file_path) as f:
        for i, line in enumerate(f):
            if line.startswith("#"):
                key, _, value = line[1:].partition("=")
                meta[key.strip()] = value.strip()
            else:
                data_start = i
                break

    df = pd.read_csv(file_path, skiprows=data_start, comment="#")
    self.data = {col: df[col].to_numpy() for col in df.columns}
    self.data.update(meta)   # metadata from header lines
    return {}

Format: VAMAS (`.vms`)¶

VAMAS is a common format for XPS and other surface science data.

def handle_vamas_file(self, file_path: str) -> dict[str, Any]:
    try:
        from vamas import Vamas
    except ImportError:
        raise ImportError("pip install vamas")

    vms = Vamas(file_path)
    block = vms.blocks[0]   # first spectrum; iterate for multiple

    self.data = {
        "kinetic_energy":     block.x,
        "intensity":          block.y,
        "source_energy":      block.source_energy,
        "pass_energy":        block.analyser_pass_energy,
        "dwell_time":         block.signal_collection_time,
        "sample_id":          block.sample_id,
        "technique":          block.technique,
        "comment":            block.comment,
    }
    return {}

If you have multiple blocks (spectra), store them as a list and loop in get_entry_names / get_data.

Format: Igor Pro IBW (`.ibw`)¶

import igor2.igorpy as igor

def handle_ibw_file(self, file_path: str) -> dict[str, Any]:
    wave = igor.load(file_path)
    self.data = {
        "data": wave.data,
        "note": wave.notes.decode() if wave.notes else "",
    }
    # axis scaling
    for dim, (offset, delta) in enumerate(zip(wave.sfA, wave.sfB)):
        n = wave.data.shape[dim]
        axis = offset + delta * np.arange(n)
        self.data[f"axis_{dim}"] = axis
    return {}

Or for JSON note format:

import igor2.igorpy as igor
import json

def handle_ibw_file(self, file_path: str) -> dict[str, Any]:
    wave = igor.load(file_path)
    self.data = {"signal": wave.data}
    try:
        meta = json.loads(wave.notes.decode())
        for k, v in meta.items():
            self.data[f"meta/{k}"] = v
    except (json.JSONDecodeError, AttributeError):
        pass
    return {}

¶

Format: NetCDF (`.nc`)¶

import xarray as xr

def handle_netcdf_file(self, file_path: str) -> dict[str, Any]:
    ds = xr.open_dataset(file_path)
    self.data = {}
    for var in ds.data_vars:
        self.data[var] = ds[var].values
    for coord in ds.coords:
        self.data[f"axis/{coord}"] = ds.coords[coord].values
    for attr_key, attr_val in ds.attrs.items():
        self.data[f"attrs/{attr_key}"] = attr_val
    return {}

Format: TIFF / detector images¶

 import tifffile

def handle_tiff_file(self, file_path: str) -> dict[str, Any]:
    with tifffile.TiffFile(file_path) as tif:
        data = tif.asarray()       # shape: (frames, height, width) or (H, W)
        meta = {}
        if tif.is_imagej:
            meta = tif.imagej_metadata or {}
        elif tif.pages[0].tags:
            for tag in tif.pages[0].tags.values():
                meta[tag.name] = tag.value

    self.data = {"detector/image": data}
    self.data.update({f"meta/{k}": v for k, v in meta.items()})
    return {}

Format: Plain JSON / YAML¶

import json, yaml   # yaml: pip install pyyaml

def handle_json_file(self, file_path: str) -> dict[str, Any]:
    with open(file_path) as f:
        raw = json.load(f)
    self.data = self._flatten(raw)
    return {}

def handle_yaml_data_file(self, file_path: str) -> dict[str, Any]:
    with open(file_path) as f:
        raw = yaml.safe_load(f)
    self.data = self._flatten(raw)
    return {}

def _flatten(self, d: dict, parent: str = "") -> dict:
    """Recursively flatten a nested dict into slash-separated keys."""
    result = {}
    for k, v in d.items():
        full = f"{parent}/{k}" if parent else k
        if isinstance(v, dict):
            result.update(self._flatten(v, full))
        else:
            result[full] = v
    return result

Format: anything else — the fallback pattern¶

When nothing above fits, write a minimal parser that extracts the values you need and stores them in self.data:

def handle_my_format(self, file_path: str) -> dict[str, Any]:
    self.data = {}

    with open(file_path, "rb") as f:   # or "r" for text
        raw = f.read()

    # --- parse raw bytes / text here ---
    # e.g. use struct, regex, or your vendor's SDK
    # ---

    # Store whatever you extract:
    self.data["signal"] = ...
    self.data["energy_axis"] = ...
    self.data["sample_name"] = ...

    return {}

Then add the extension to self.extensions in __init__:

self.extensions[".myext"] = self.handle_my_format

Step 3 — Adapt the callbacks (~20 min)¶

Once self.data is populated the callbacks are trivial.

If you used a single self.data dict (not self.hdf5_data), update the callback methods:

def get_attr(self, key: str, path: str) -> Any:
    if self.data is None:
        return None
    value = self.data.get(path)
    # decode byte strings if needed
    if isinstance(value, bytes):
        return value.decode()
    return value

def get_eln_data(self, key: str, path: str) -> Any:
    if self.eln_data is None:
        return None
    return self.eln_data.get(key)

def get_data(self, key: str, path: str) -> Any:
    if self.data is None:
        return None
    value = self.data.get(path)
    if value is None:
        logger.warning(f"No data at path '{path}'.")
    return value

Step 4 — Find your application definition (~20 min)¶

Does one already exist?¶

Check the NeXus application definitions and installed pynxtools plugins:

Technique	application definition	`pynxtools` plugin
XPS	`NXxps`	`pynxtools-xps`
ARPES / multi-photon	`NXmpes`, `NXmpes_arpes`, `NXarpes`	`pynxtools-mpes`
Raman	`NXraman`	`pynxtools-raman`
Ellipsometry	`NXellipsometry`	`pynxtools-ellips`
Electron microscopy	`NXem`	`pynxtools-em`
Atom probe	`NXapm`	—
IXS / canSAS	`NXcanSAS`, `NXiqproc`	—
Generic simple	`NXsimple`	this workshop

Test whether the application definition is known:

dataconverter generate-template --nxdl NXmpes

No application definition? Write a minimal one.¶

See the tutorial on Writing your first application definition.

Minimal skeleton:

NXmytechnique.nxdl.xml

<?xml version="1.0" encoding="UTF-8"?>
<definition xmlns="http://definition.nexusformat.org/nxdl/3.1"
            category="application" name="NXmytechnique"
            extends="NXobject" type="group">
    <doc>Application definition for my technique.</doc>
    <group type="NXentry">
        <field name="title"/>
        <field name="definition">
            <enumeration><item value="NXmytechnique"/></enumeration>
        </field>
        <group type="NXinstrument">
            <field name="name"/>
        </group>
        <group name="sample" type="NXsample">
            <field name="name"/>
        </group>
        <group name="data" type="NXdata"/>
    </group>
</definition>

Add it to point pynxtools in the contributed_definitions folder.

Step 5 — Write the config file (~40 min)¶

Mapping checklist¶

dataconverter generate-template --nxdl <YOUR_NXDL> > template.txt

Work through the output line by line. For each path, fill in the config JSON with the right @-prefix:

Where is the value?	Config value
`self.data["some/key"]`	`"@attrs:some/key"`
`self.eln_data["/ENTRY[entry]/..."]`	`"@eln"`
`self.data["signal_array"]`	`"@data:signal_array"`
Always the same literal	`"fixed string"` or `42`
Derived in `post_process`	`"@attrs:derived/my_value"`

Handling missing data gracefully¶

Prefix a config key with ! if the whole parent group should be dropped when the value is absent:

{
  "/ENTRY/INSTRUMENT[instrument]/DETECTOR[detector]": {
    "!count_time": "@attrs:detector/count_time",
    "count_time/@units": "@attrs:detector/count_time_units"
  }
}

If count_time returns None, the entire DETECTOR[detector] group is silently removed from the output instead of causing a validation error.

Unit fields¶

NeXus requires units for every numeric field. Options:

{ "/ENTRY/data/energy/@units": "eV" }                          // hard-coded
{ "/ENTRY/data/energy/@units": "@attrs:data/energy_units" }    // from file

Step 6 — Convert, validate, iterate (~20 min)¶

dataconverter \
    your_file.ext \
    eln_data.yaml \
    config_file.json \
    --reader <your-reader> \
    --nxdl <YOUR_NXDL> \
    --output output.nxs

Read the output messages:

ERROR — required field missing → add to config
WARNING — recommended field missing → add if possible
INFO — optional field missing → safe to skip

Inspect the result:

import h5py
with h5py.File("output.nxs", "r") as f:
    f.visititems(lambda n, o: print(n) if isinstance(o, h5py.Dataset) else None)

Repeat until no errors remain.

Advanced patterns¶

Multiple spectra / entries per file¶

def get_entry_names(self) -> list[str]:
    """Return one entry name per spectrum in the file."""
    if self.data is None:
        return ["entry"]
    return [f"spectrum_{i}" for i in range(len(self.data["spectra"]))]

Then use wildcard keys in the config with *:

{ "/ENTRY/data/*": "@data:signal" }

Unit conversion in a callback¶

def get_attr(self, key: str, path: str) -> Any:
    value = self.data.get(path) if self.data else None
    if value is None:
        return None
    # Example: convert Celsius to Kelvin for temperature fields
    if "temperature" in key and "units" not in key:
        return float(value) + 273.15
    if isinstance(value, bytes):
        return value.decode()
    return value

Derived quantities in `post_process`¶

def post_process(self) -> None:
    """Compute quantities that depend on multiple raw values."""
    if not self.data:
        return
    import numpy as np
    x = self.data.get("energy")
    y = self.data.get("counts")
    if x is not None and y is not None:
        peak_idx = np.argmax(y)
        self.data["derived/peak_energy"] = x[peak_idx]
        self.data["derived/peak_energy_units"] = b"eV"

Common errors and fixes¶

Error / symptom	Cause	Fix
`ModuleNotFoundError: <vendor lib>`	Library not installed	`pip install <library>` — see table above
`KeyError: 'some/path'` in callback	Path missing from `self.data`	Print `sorted(self.data.keys())` to find the right key
Required field missing in output	Config doesn't map it	Add the path to config file
`bytes` in output string field	h5py byte string	Add `.decode()` in the callback
Numeric value has wrong magnitude	Unit mismatch	Apply conversion in the callback
All `get_eln_data` return `None`	Wrong CONVERT_DICT	Print `self.eln_data.keys()` vs `key` argument
File opens but data looks wrong	Wrong dataset path	Print the full `self.data` dict and re-map
`struct.error` / garbage in binary file	Wrong offset or dtype	Check vendor documentation for byte layout
Validation passes but file looks incomplete	application definition has no required fields	Add required fields to the NXDL

Checklist before you leave¶

[ ] dataconverter runs without errors on your own data
[ ] All required fields in the application definition are present in output.nxs
[ ] Units are set for every numeric field
[ ] reader.py and config_file.json are committed to your repository
[ ] You know which application definition matches your technique (or have written a minimal one)