Implementation design¶
Pynxtools-em addresses current challenges with the representation of information in the field of electron microscopy through leading by examples. Specifically, the tool implements functionalities for ontology matching of instance data from proprietary representations to an open format and standard - NeXus and the NXem application definitions and respective concepts. The examples which pynxtools-em currently supports are for sure a compromise. This is clear, if we consider how large the potential space of all possible naming conventions and combinations of pieces of information in the electron microscopy community is that one could and in which formatting one could share knowledge of this field. Existent differences (in terms of precision, language, encoding style, and constraints) how technically information can be serialized make this combinatorial space of possible representations even larger.
The following design patterns guide our implementation:
- We do not consider that our work is complete from the perspective of the often short-term-project-driven mindset but that standardization is a community effort.
- We consider ontology matching a team effort that can only be achieved with technology partners and scientists working together.
- We acknowledge the efforts and key contributions that went into the development of file format reading libraries and data analysis libraries of the electron microscopy community. We had to start somewhere and we did so with tools from the hyperspy ecosystem rosettasccio, pyxem, and kikuchipy. We would be happy to work together with representatives from the many other great software packages within the community. We have realized that being able to read as well as to write instance data in specific file formats of the community is a necessary but not a sufficient enough achievement went electron microscopy data should really at some point fulfill the FAIR principles. Until the community will have arrived to the point that for every such file format there is also an exchange of the specification of the format. It is this semantic documentation which is often missing still in the field of electron microscopy because of which still many formats and hence activities rely on having to reverse engineer reading capabilities instead of being able to take advantage already from having semantically interoperable machine-actionable electron microscopy information storage.
- Our work is open to suggestions by all members of the electron microscopy community.
- We have choosen specific tangible examples of (meta)data semantic mapping for specific methods that are used in electron microscopy. These examples have been implemented to explore along two routes:
- There are examples of parsing capabilities (for TIFF and PNG) which address rather technical aspects (e.g. how to read from such files and pick technology-partner-specific formatting of instance data). These examples must not be understood as that they implemented such that they provide parsing capabilities for all arbitrary examples with that specific mime type. There are two practical reasons for this: Firstly, limited manpower to implement all this. Secondly, limited availability of reliable documentation that defines the semantic content that specific file formats from technology partners encode. Both challenges can be solved: The first one with support from the community. The second one with support from technology partners.
- There are examples of parsing examples for a specific method - for now Electron Backscatter Diffraction (EBSD) and Transmission Kikuchi Diffraction (TKD) - for which we have explored how a large number of different formats can be parsed and how that parsing can be made more general and robust than to just support one prototypic example as we see it happening currently in the field of research data management. We are aware that we do not parse everything but rather an exemplar subset that suffices to offer a comprehensive example of how key information such as orientation maps, region-of-interest analyzed, or descriptors derived from these analyses, can be harmonized. The reason for this selection has been motivated by the fact to show that there is at all a benefit of normalizing comprehensive and technically deep representations of electron microscopy data.
Purpose and aim of pynxtools-em¶
We would like to provide context to the purpose and aim of pynxtools-em. The software implements a suggestion how diverse (meta)data from the research field of electron microscopy can be parsed and normalized to enable users a precise comparison of information and knowledge. This means instance data need to be powered by semantics. The software pynxtools-em maps instance data from different formatting and concepts onto a proposal for a common information exchange and representation semantic representation via NeXus. The software achieves this through a two-stepped process of parsing. Firstly, via reading data from technology partner or specific serialization and formatting. Secondly, via applying transformations (if required) to map on NeXus. One of the key motivations for the development of pynxtools and its plugins was to explore and show how pieces of information can be harmonized and matched to enable the development of data-centric software tools and services in research data management systems (RDMS). The key reason to place such code into plugins rather than into the RDM source code itself is to promote reusability, to offer users a stronger modularizability and possibilities for tailoring of the RDM. Avoiding a duplication of development efforts that typically come with having to maintain many instead of a few codes, having many file formats instead of a few ones which are defined based on semantics is an effective strategy to focus when developer resources are finite like in electron microscopy.
Software tools in electron microscopy - a mixture of proprietary and open-source solutions¶
Typically, users work with proprietary software from technology partners and use custom-written software (many of which nowadays have an open-source license) frequently. Proprietary software offers frequently lower usage barriers for end users surplus offer specifically tailored access to and capabilities of storing instrument-specific (meta)data via a user interface that is optimized for efficient usage of the microscope. As a burden, proprietary software tools often write (not exclusively though) to proprietary serialization formats (file or database entries). These formats encode semantic concepts but that knowledge is proprietary. The key challenge is that the content and meaning of these concepts is very often not documented publicly. Therefore, there is a frequent necessity for having to convert between formats. When such conversions are performed ad hoc, substantial contextual information can get lost or become disconnected which makes tracking of workflows in electron microscopy difficult.
Several proprietary software tools implement the execution of script-based analyses. This scripting is also a key signature of the many software tools with an open-source development mindset and license. These offer an increasingly competitive alternative to proprietary software tools in electron microscopy. The combination of open source code, customizability, and the rooting (or often only reason for their existence) in exploring cutting-edge prototyping of algorithms and ideas by the scientific community, has made script-based software nowadays a reality in many electron microscopy labs - especially in the Python world. This justifies thoughts on how using such software aligns with the aims of the FAIR principles of data stewardship.
Here, script-based analyses can be considered a benefit and a burden when it comes to FAIR principles: The flexibility of being able to script ones analysis is a clear benefit. However, it is a burden at the same time because of the current state of how such workflows are typically documented. That is by users sharing such scripts alongside processed data in some processed state surplus the (close to) or final figures that were generated with the publication. Hence, it is often just assumed that these scripts not only work for different versions of the execution environment (i.e. different Matlab version) but also that users can obtain the same results - provided they run the scripts again using the data if these are provided.
The truth though is that this often leaves room for substantial interpretation and ambiguity. When there is neither a community agreed-upon standard of how to exchange information nor a thorough documentation of the execution environment, and possibly a lack of how the serialized artifacts (files, database entries) were processed through this workflow. The practical challenge is not that no output files to such script execution are shared but that these are shared using a large variety of formats many of which using ad hoc data schemas. This is a substantial burden from the perspective of ontology matching because pieces of information are encoded and named differently although they (to human experts) represent instances of similar, equivalent, or even exactly the same concepts. So far it demands the capabilities of members and often domain experts within the electron microscopy community to assure that data can safely be compared from a scientific point of view. It is this not yet fully implemented semantic mapping which technically limits interoperable knowledge exchange in electron microscopy rather than being able to read or write a specific file format.