Write Your Own Suitcase¶
Scope of a Suitcase¶
Suitcases are Highly Specific¶
Suitcase translates documents generated by bluesky (or anything that follows its “event model” schema) into file formats.
Suitcase’s design philosophy is to make many well-tailored suitcases rather than try to fit a large range of functionality into one suitcase.
Each file format is implemented in a separate Python package, named
suitcase-<format>
. As support for new formats is added over time, there may
someday be hundreds of suitcase packages. This modular approach will keep the
number of dependencies manageable (no need to install heavy I/O libraries that
you don’t plan to use). It will also allow each suitcase to be updated and
released on its own schedule and maintained by the specific communities,
facilities, or users who care about a particular format.
Even “one suitcase per file format” is too broad. Some formats, such as HDF5, enable a huge variety of layouts—too many configure via a reasonable number of parameters. Therefore, there will never be a “suitcase-hdf5” package, but rather multiple suitcases, each tuned a specific HDF5 layout such as NeXuS or Data Exchange.
Categories of Suitcases¶
The list of existing and planned suitcases groups them into three categories.
“One-offs” — These are tailed to one specific application, writing files to the requirements of a particular software program or user.
“Generics” — These write commonly-requested formats such as TIFF or CSV. There is often room for interpretation in how exactly to lay out the data into a given file format. (One TIFF file per detector? Per Event? Per exposure?) The design process can devolve into tricky judgment calls or a confusing array of options for the user. When it doubt, we encourage you to steer toward writing one or more “one-offs”.
“Backends” — These are less user-facing that the other two categories. They write into a file meant to be read back be a programmatic interface. For example, suitcase-mongo insert documents into MongoDB.
Creating a New Suitcase Package¶
Write the Serializer¶
Before reading this section, read to the end of Usage.
All suitcase packages must contain a Serializer
class with the
interface outlined below. It should also contain an export()
function.
These should be in suitcase/my-special-format/__init__.py
.
Here is a sketch of a Serializer
import event_model
from pathlib import Path
import suitcase.utils
class Serializer(event_model.DocumentRouter):
def __init__(self, directory, file_prefix='{uid}', **kwargs):
self._file_prefix = file_prefix
self._kwargs = kwargs
self._templated_file_prefix = '' # set when we get a 'start' document
if isinstance(directory, (str, Path)):
# The user has given us a filepath; they want files.
# Set up a MultiFileManager for them.
self._manager = suitcase.utils.MultiFileManager(directory)
else:
# The user has given us their own Manager instance. Use that.
self._manager = directory
# Finally, we usually need some state related to stashing file
# handles/buffers. For a Serializer that only needs *one* file
# this may be:
#
# self._output_file = None
#
# For a Serializer that writes a separate file per stream:
#
# self._files = {}
@property
def artifacts(self):
# The 'artifacts' are the manager's way to exposing to the user a
# way to get at the resources that were created. For
# `MultiFileManager`, the artifacts are filenames. For
# `MemoryBuffersManager`, the artifacts are the buffer objects
# themselves. The Serializer, in turn, exposes that to the user here.
#
# This must be a property, not a plain attribute, because the
# manager's `artifacts` attribute is also a property, and we must
# access it anew each time to be sure to get the latest contents.
return self._manager.artifacts
def close(self):
self._manager.close()
# These methods enable the Serializer to be used as a context manager:
#
# with Serializer(...) as serializer:
# ...
#
# which always calls close() on exit from the with block.
def __enter__(self):
return self
def __exit__(self, *exception_details):
self.close()
# Each of the methods below corresponds to a document type. As
# documents flow in through Serializer.__call__, the DocumentRouter base
# class will forward them to the method with the name corresponding to
# the document's type: RunStart documents go to the 'start' method,
# etc.
#
# In each of these methods:
#
# - If needed, obtain a new file/buffer from the manager and stash it
# on instance state (self._files, etc.) if you will need it again
# later. Example:
#
# filename = f'{self._templated_file_prefix}-primary.csv'
# file = self._manager.open('stream_data', filename, 'xt')
# self._files['primary'] = file
#
# See the manager documentation below for more about the arguments to open().
#
# - Write data into the file, usually something like:
#
# content = my_function(doc)
# file.write(content)
#
# or
#
# my_function(doc, file)
def start(self, doc):
# Fill in the file_prefix with the contents of the RunStart document.
# As in, '{uid}' -> 'c1790369-e4b2-46c7-a294-7abfa239691a'
# or 'my-data-from-{plan-name}' -> 'my-data-from-scan'
self._templated_file_prefix = self._file_prefix.format(**doc)
...
def descriptor(self, doc):
...
def event_page(self, doc):
# There are other representations of Event data -- 'event' and
# 'bulk_events' (deprecated). But that does not concern us because
# DocumentRouter will convert this representations to 'event_page'
# then route them through here.
...
def stop(self, doc):
...
See the API Documentation below for more information about
DocumentRouter
and
MultiFileManager
.
Any of the existing suitcases may be useful as a reference. We recommend these in particular:
suitcase-csv is a good introductory example.
suitcase-jsonl generates a straightforward, single-file format.
suitcase-tiff generates many separate binary files.
Note
Why not put the boilerplate code above into a base class, like
BaseSerializer
and use inheritance?
The amount of boilerplate is not large, and it may be easier to simply copy
it than to cross-reference between a subclass and a base class.
Additionally, the details can vary enough from one Serializer
that
inheritence tends to get messy.
Add an export function¶
This is just a simple wrapper around the Serializer
. It takes a
generator of (name, doc)
pairs and pushes them through the
Serializer
.
def export(gen, directory, file_prefix='{uid}-', **kwargs):
with Serializer(directory, file_prefix, **kwargs) as serializer:
for item in gen:
serializer(*item)
return serializer.artifacts
Test the Serializer¶
The suitcase-utils package provides a parametrized pytest fixture,
example_data
for generating test data. Tests should go in
suitcase/my-special-format/tests/tests.py
.
import json
from suitcase.my_special_format import export, NumpyEncoder
def test_export(tmp_path, example_data):
# Exercise the exporter on the myriad cases parametrized in example_data.
documents = example_data()
artifacts = export(documents, tmp_path)
# For extra credit, read back the data
# and check that it looks right.
Run the tests with pytest:
pytest
API Documentation¶
The DocumentRouter
is typically useful as base class for a
Serializer
.
- class event_model.DocumentRouter(*, emit=None)[source]¶
Route each document by type to a corresponding method.
When an instance is called with a document type and a document like:
router(name, doc)
the document is passed to the method of the corresponding name, as in:
getattr(router, name)(doc)
The method is expected to return
None
or a valid document of the same type. It may be the original instance (passed through), a copy, or a different dict altogether.Finally, the call to
router(name, doc)
returns:(name, getattr(router, name)(doc))
- Parameters
- emit: callable, optional
Expected signature
f(name, doc)
There are “manager” classes for files and memory buffers. The user may provide their own manager class implementing a different transport mechanism. It need only implement these same methods.
- class suitcase.utils.MultiFileManager(directory, allowed_modes=('x', 'xt', 'xb'))[source]¶
A class that manages multiple files.
- Parameters
- directorystr or Path
The directory (as a string or as a Path) to create teh files inside.
- allowed_modesIterable
Modes accepted by
MultiFileManager.open
. By default this is restricted to “exclusive creation” modes (‘x’, ‘xt’, ‘xb’) which raise an error if the file already exists. This choice of defaults is meant to protect the user for unintentionally overwriting old files. In situations where overwrite (‘w’, ‘wb’) or append (‘a’, ‘r+b’) are needed, they can be added here.- This design is inspired by Python’s zipfile and tarfile libraries.
- property artifacts¶
Provides dictionary mapping artifact labels to (file)names.
- property estimated_sizes¶
Provides dictionary mapping artifact postfix to current size.
- get_artifacts(label=None)[source]¶
Returns list of dicts, each populated with artifact properties.
- Parameters
- labelstring
Optional. Filter returned list to include only artifacts that match the given label value.
- open(label, postfix, mode, encoding=None, errors=None)[source]¶
Request a file handle.
Like the built-in open function, this may be used as a context manager.
- Parameters
- labelstring
A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.
- postfixstring
Postfix for the file name. Must be unique for this Manager.
- modestring
One of the
allowed_modes
set in __init__``. Default set of options is{'x', 'xt', xb'}
— ‘x’ or ‘xt’ for text, ‘xb’ for binary.- encodingstring or None
Passed through open. See Python open documentation for allowed values. Only applicable to text mode.
- errorsstring or None
Passed through to open. See Python open documentation for allowed values.
- Returns
- filehandle
- reserve_name(label, postfix)[source]¶
Ask the wrapper for a filepath.
An external library that needs a filepath (not a handle) may use this instead of the
open
method.- Parameters
- labelstring
A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.
- postfixstring
Postfix for the file name. Must be unique for this Manager.
- Returns
- namePath
- class suitcase.utils.MemoryBuffersManager[source]¶
A class that manages multiple StringIO and/or BytesIO instances.
This design is inspired by Python’s zipfile and tarfile libraries.
This has a special buffers attribute which can be used to retrieve buffers created.
- property artifacts¶
Provides dictionary mapping artifact labels to (file)names.
- property estimated_sizes¶
Provides dictionary mapping artifact postfix to current size.
- get_artifacts(label=None)[source]¶
Returns list of dicts, each populated with artifact properties.
- Parameters
- labelstring
Optional. Filter returned list to include only artifacts that match the given label value.
- open(label, postfix, mode, encoding=None, errors=None)[source]¶
Request a file handle.
Like the built-in open function, this may be used as a context manager.
- Parameters
- labelstring
A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.
- postfixstring
Relative file path (simply used as an identifer in this case, as there is no actual file). Must be unique for this Manager.
- mode{‘x’, ‘xt’, xb’}
‘x’ or ‘xt’ for text, ‘xb’ for binary
- encodingstring or None
Not used. Accepted for compatibility with built-in open().
- errorsstring or None
Not used. Accepted for compatibility with built-in open().
- Returns
- filehandle
- reserve_name(label, postfix)[source]¶
This action is not valid on this manager. It will always raise.
- Parameters
- labelstring
A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.
- postfixstring
Relative file path. Must be unique for this Manager.
- Raises
- SuitcaseUtilsTypeError
These classes are used by the MemoryBuffersManager
.
- class suitcase.utils.PersistentStringIO(initial_value='', newline='\n')[source]¶
A StringIO that does not clear the buffer when closed.
Note
This StringIO subclass behaves like StringIO except that its close() method, which would normally clear the buffer, has no effect. The clear() method, however, may still be used.
- class suitcase.utils.PersistentBytesIO(initial_bytes=b'')[source]¶
A BytesIO that does not clear the buffer when closed.
Note
This BytesIO subclass behaves like BytesIO except that its close() method, which would normally clear the buffer, has no effect. The clear() method, however, may still be used.