Bring your own data¶
This tutorial will guide you through writing a pynxtools reader plugin for the data from your specific experiment.
What should you should know before this tutorial?¶
- This is a direct follow-up to the tutorial on building your own
pynxtoolsreader, see Building your first pynxtools reader. If you are unfamiliar withpynxtools, you should go through that tutorial first.
What you will know at the end of this tutorial?¶
You will know
- how to set up a reader fpr the data from your specific experiment
- how to validate your very own NeXus file
- how to upload your NeXus file to NOMAD
About this tutorial¶
Duration: ~3 hours (self-paced)
Goal: Apply the MultiFormatReader pattern to your own instrument data and produce a validated NeXus file.
What to bring: one or more data files from your instrument.
The core idea — always the same three steps¶
No matter what format your data comes in, you always do the same three things:
Step A Read your file(s) into flat Python dicts on self
Step B Write a config file that maps dict keys → NeXus paths
Step C Run the converter and fix validation errors
The only part that changes between techniques and formats is Step A — the reading logic. Steps B and C are identical to Day 1.
Step 1 — Know your format (~30 min)¶
Before writing any reader code, understand what you are working with.
Identify your format¶
| Format | Typical extensions | How to recognize |
|---|---|---|
| HDF5 / NeXus | .h5, .hdf5, .nxs |
Binary; starts with \x89HDF |
| HDF5 (instrument brand) | .h5m, .hsp, .he5, … |
Same magic bytes; vendor-specific internal layout |
| VAMAS | .vms, .vamas |
First line: VAMAS Surface Chemical Analysis |
| Igor Pro wave | .ibw |
Binary with IGOR header |
| CSV / TSV | .csv, .txt, .dat, .asc |
Human-readable columns |
| JSON | .json |
{ or [ as first non-whitespace character |
| YAML | .yaml, .yml |
Key-value pairs with indentation |
| NetCDF | .nc, .cdf, .netcdf |
Binary; readable with netCDF4 or xarray |
| TIFF (detector images) | .tiff, .tif |
Binary image; use tifffile or PIL |
Step 2 — Read your file into a flat dict (~45 min)¶
Pick the section below that matches your format.
The goal in every case: populate self.data (or self.hdf5_data) with a flat
dict of "path/to/quantity" → value.
Format: HDF5 (any vendor)¶
This is the same recursive reader from Day 1. It works for any HDF5 file — vendor-specific layouts, NeXus files, everything.
import h5py
def handle_hdf5_file(self, file_path: str) -> dict[str, Any]:
def _recurse(group, path=""):
result = {}
for key, item in group.items():
full = f"{path}/{key}" if path else key
if isinstance(item, h5py.Group):
result.update(_recurse(item, full))
elif isinstance(item, h5py.Dataset):
result[full] = item[()]
# also capture group-level attributes
for attr_key, attr_val in group.attrs.items():
result[f"{path}/@{attr_key}" if path else f"@{attr_key}"] = attr_val
return result
with h5py.File(file_path, "r") as hdf:
self.hdf5_data = _recurse(hdf)
return {}
After running, print the keys:
Map what you see to what NXsimple (or your application definition) needs.
Format: CSV / TSV / columnar text¶
import numpy as np
def handle_csv_file(self, file_path: str) -> dict[str, Any]:
# Adjust delimiter, skiprows, and encoding for your file
data = np.genfromtxt(
file_path,
delimiter=",", # "\t" for TSV, None for whitespace
names=True, # use first row as column names
encoding="utf-8",
skip_header=0,
)
self.data = {name: data[name] for name in data.dtype.names}
return {}
Or with pandas (more flexible for messy headers):
import pandas as pd
def handle_csv_file(self, file_path: str) -> dict[str, Any]:
# Read metadata lines before the data block (if any)
meta = {}
data_start = 0
with open(file_path) as f:
for i, line in enumerate(f):
if line.startswith("#"):
key, _, value = line[1:].partition("=")
meta[key.strip()] = value.strip()
else:
data_start = i
break
df = pd.read_csv(file_path, skiprows=data_start, comment="#")
self.data = {col: df[col].to_numpy() for col in df.columns}
self.data.update(meta) # metadata from header lines
return {}
Format: VAMAS (.vms)¶
VAMAS is a common format for XPS and other surface science data.
def handle_vamas_file(self, file_path: str) -> dict[str, Any]:
try:
from vamas import Vamas
except ImportError:
raise ImportError("pip install vamas")
vms = Vamas(file_path)
block = vms.blocks[0] # first spectrum; iterate for multiple
self.data = {
"kinetic_energy": block.x,
"intensity": block.y,
"source_energy": block.source_energy,
"pass_energy": block.analyser_pass_energy,
"dwell_time": block.signal_collection_time,
"sample_id": block.sample_id,
"technique": block.technique,
"comment": block.comment,
}
return {}
If you have multiple blocks (spectra), store them as a list and loop in
get_entry_names / get_data.
Format: Igor Pro IBW (.ibw)¶
import igor2.igorpy as igor
def handle_ibw_file(self, file_path: str) -> dict[str, Any]:
wave = igor.load(file_path)
self.data = {
"data": wave.data,
"note": wave.notes.decode() if wave.notes else "",
}
# axis scaling
for dim, (offset, delta) in enumerate(zip(wave.sfA, wave.sfB)):
n = wave.data.shape[dim]
axis = offset + delta * np.arange(n)
self.data[f"axis_{dim}"] = axis
return {}
Or for JSON note format:
import igor2.igorpy as igor
import json
def handle_ibw_file(self, file_path: str) -> dict[str, Any]:
wave = igor.load(file_path)
self.data = {"signal": wave.data}
try:
meta = json.loads(wave.notes.decode())
for k, v in meta.items():
self.data[f"meta/{k}"] = v
except (json.JSONDecodeError, AttributeError):
pass
return {}
¶
import igor2.igorpy as igor
import json
def handle_ibw_file(self, file_path: str) -> dict[str, Any]:
wave = igor.load(file_path)
self.data = {"signal": wave.data}
try:
meta = json.loads(wave.notes.decode())
for k, v in meta.items():
self.data[f"meta/{k}"] = v
except (json.JSONDecodeError, AttributeError):
pass
return {}
Format: NetCDF (.nc)¶
import xarray as xr
def handle_netcdf_file(self, file_path: str) -> dict[str, Any]:
ds = xr.open_dataset(file_path)
self.data = {}
for var in ds.data_vars:
self.data[var] = ds[var].values
for coord in ds.coords:
self.data[f"axis/{coord}"] = ds.coords[coord].values
for attr_key, attr_val in ds.attrs.items():
self.data[f"attrs/{attr_key}"] = attr_val
return {}
Format: TIFF / detector images¶
import tifffile
def handle_tiff_file(self, file_path: str) -> dict[str, Any]:
with tifffile.TiffFile(file_path) as tif:
data = tif.asarray() # shape: (frames, height, width) or (H, W)
meta = {}
if tif.is_imagej:
meta = tif.imagej_metadata or {}
elif tif.pages[0].tags:
for tag in tif.pages[0].tags.values():
meta[tag.name] = tag.value
self.data = {"detector/image": data}
self.data.update({f"meta/{k}": v for k, v in meta.items()})
return {}
Format: Plain JSON / YAML¶
import json, yaml # yaml: pip install pyyaml
def handle_json_file(self, file_path: str) -> dict[str, Any]:
with open(file_path) as f:
raw = json.load(f)
self.data = self._flatten(raw)
return {}
def handle_yaml_data_file(self, file_path: str) -> dict[str, Any]:
with open(file_path) as f:
raw = yaml.safe_load(f)
self.data = self._flatten(raw)
return {}
def _flatten(self, d: dict, parent: str = "") -> dict:
"""Recursively flatten a nested dict into slash-separated keys."""
result = {}
for k, v in d.items():
full = f"{parent}/{k}" if parent else k
if isinstance(v, dict):
result.update(self._flatten(v, full))
else:
result[full] = v
return result
Format: anything else — the fallback pattern¶
When nothing above fits, write a minimal parser that extracts the values you
need and stores them in self.data:
def handle_my_format(self, file_path: str) -> dict[str, Any]:
self.data = {}
with open(file_path, "rb") as f: # or "r" for text
raw = f.read()
# --- parse raw bytes / text here ---
# e.g. use struct, regex, or your vendor's SDK
# ---
# Store whatever you extract:
self.data["signal"] = ...
self.data["energy_axis"] = ...
self.data["sample_name"] = ...
return {}
Then add the extension to self.extensions in __init__:
Step 3 — Adapt the callbacks (~20 min)¶
Once self.data is populated the callbacks are trivial.
If you used a single self.data dict (not self.hdf5_data), update the
callback methods:
def get_attr(self, key: str, path: str) -> Any:
if self.data is None:
return None
value = self.data.get(path)
# decode byte strings if needed
if isinstance(value, bytes):
return value.decode()
return value
def get_eln_data(self, key: str, path: str) -> Any:
if self.eln_data is None:
return None
return self.eln_data.get(key)
def get_data(self, key: str, path: str) -> Any:
if self.data is None:
return None
value = self.data.get(path)
if value is None:
logger.warning(f"No data at path '{path}'.")
return value
Step 4 — Find your application definition (~20 min)¶
Does one already exist?¶
Check the NeXus application definitions
and installed pynxtools plugins:
| Technique | application definition | pynxtools plugin |
|---|---|---|
| XPS | NXxps |
pynxtools-xps |
| ARPES / multi-photon | NXmpes, NXmpes_arpes, NXarpes |
pynxtools-mpes |
| Raman | NXraman |
pynxtools-raman |
| Ellipsometry | NXellipsometry |
pynxtools-ellips |
| Electron microscopy | NXem |
pynxtools-em |
| Atom probe | NXapm |
— |
| IXS / canSAS | NXcanSAS, NXiqproc |
— |
| Generic simple | NXsimple |
this workshop |
Test whether the application definition is known:
No application definition? Write a minimal one.¶
See the tutorial on Writing your first application definition.
Minimal skeleton:
<?xml version="1.0" encoding="UTF-8"?>
<definition xmlns="http://definition.nexusformat.org/nxdl/3.1"
category="application" name="NXmytechnique"
extends="NXobject" type="group">
<doc>Application definition for my technique.</doc>
<group type="NXentry">
<field name="title"/>
<field name="definition">
<enumeration><item value="NXmytechnique"/></enumeration>
</field>
<group type="NXinstrument">
<field name="name"/>
</group>
<group name="sample" type="NXsample">
<field name="name"/>
</group>
<group name="data" type="NXdata"/>
</group>
</definition>
Add it to point pynxtools in the contributed_definitions folder.
Step 5 — Write the config file (~40 min)¶
Mapping checklist¶
Work through the output line by line. For each path, fill in the
config JSON with the right @-prefix:
| Where is the value? | Config value |
|---|---|
self.data["some/key"] |
"@attrs:some/key" |
self.eln_data["/ENTRY[entry]/..."] |
"@eln" |
self.data["signal_array"] |
"@data:signal_array" |
| Always the same literal | "fixed string" or 42 |
Derived in post_process |
"@attrs:derived/my_value" |
Handling missing data gracefully¶
Prefix a config key with ! if the whole parent group should be dropped
when the value is absent:
{
"/ENTRY/INSTRUMENT[instrument]/DETECTOR[detector]": {
"!count_time": "@attrs:detector/count_time",
"count_time/@units": "@attrs:detector/count_time_units"
}
}
If count_time returns None, the entire DETECTOR[detector] group is
silently removed from the output instead of causing a validation error.
Unit fields¶
NeXus requires units for every numeric field. Options:
{ "/ENTRY/data/energy/@units": "eV" } // hard-coded
{ "/ENTRY/data/energy/@units": "@attrs:data/energy_units" } // from file
Step 6 — Convert, validate, iterate (~20 min)¶
dataconverter \
your_file.ext \
eln_data.yaml \
config_file.json \
--reader <your-reader> \
--nxdl <YOUR_NXDL> \
--output output.nxs
Read the output messages:
- ERROR — required field missing → add to config
- WARNING — recommended field missing → add if possible
- INFO — optional field missing → safe to skip
Inspect the result:
import h5py
with h5py.File("output.nxs", "r") as f:
f.visititems(lambda n, o: print(n) if isinstance(o, h5py.Dataset) else None)
Repeat until no errors remain.
Advanced patterns¶
Multiple spectra / entries per file¶
def get_entry_names(self) -> list[str]:
"""Return one entry name per spectrum in the file."""
if self.data is None:
return ["entry"]
return [f"spectrum_{i}" for i in range(len(self.data["spectra"]))]
Then use wildcard keys in the config with *:
Unit conversion in a callback¶
def get_attr(self, key: str, path: str) -> Any:
value = self.data.get(path) if self.data else None
if value is None:
return None
# Example: convert Celsius to Kelvin for temperature fields
if "temperature" in key and "units" not in key:
return float(value) + 273.15
if isinstance(value, bytes):
return value.decode()
return value
Derived quantities in post_process¶
def post_process(self) -> None:
"""Compute quantities that depend on multiple raw values."""
if not self.data:
return
import numpy as np
x = self.data.get("energy")
y = self.data.get("counts")
if x is not None and y is not None:
peak_idx = np.argmax(y)
self.data["derived/peak_energy"] = x[peak_idx]
self.data["derived/peak_energy_units"] = b"eV"
Common errors and fixes¶
| Error / symptom | Cause | Fix |
|---|---|---|
ModuleNotFoundError: <vendor lib> |
Library not installed | pip install <library> — see table above |
KeyError: 'some/path' in callback |
Path missing from self.data |
Print sorted(self.data.keys()) to find the right key |
| Required field missing in output | Config doesn't map it | Add the path to config file |
bytes in output string field |
h5py byte string | Add .decode() in the callback |
| Numeric value has wrong magnitude | Unit mismatch | Apply conversion in the callback |
All get_eln_data return None |
Wrong CONVERT_DICT | Print self.eln_data.keys() vs key argument |
| File opens but data looks wrong | Wrong dataset path | Print the full self.data dict and re-map |
struct.error / garbage in binary file |
Wrong offset or dtype | Check vendor documentation for byte layout |
| Validation passes but file looks incomplete | application definition has no required fields | Add required fields to the NXDL |
Checklist before you leave¶
- [ ]
dataconverterruns without errors on your own data - [ ] All required fields in the application definition are present in
output.nxs - [ ] Units are set for every numeric field
- [ ]
reader.pyandconfig_file.jsonare committed to your repository - [ ] You know which application definition matches your technique (or have written a minimal one)
Further reading¶
- How-to > Use the MultiFormatReader
- How-to > Build a plugin
- How-to > Test your reader
- Tutorial > Write your first application definition
- How-to > Write a NeXus application definition (../how-tos/nexus/write-an-application definition.md)
- Learn > The MultiFormatReader in depth
- Reference > FAIRmat-supported
pynxtoolsplugins