Write Your Own Suitcase

Scope of a Suitcase

Suitcases are Highly Specific

Suitcase translates documents generated by bluesky (or anything that follows its “event model” schema) into file formats.

Suitcase’s design philosophy is to make many well-tailored suitcases rather than try to fit a large range of functionality into one suitcase.

Each file format is implemented in a separate Python package, named suitcase-<format>. As support for new formats is added over time, there may someday be hundreds of suitcase packages. This modular approach will keep the number of dependencies manageable (no need to install heavy I/O libraries that you don’t plan to use). It will also allow each suitcase to be updated and released on its own schedule and maintained by the specific communities, facilities, or users who care about a particular format.

Even “one suitcase per file format” is too broad. Some formats, such as HDF5, enable a huge variety of layouts—too many configure via a reasonable number of parameters. Therefore, there will never be a “suitcase-hdf5” package, but rather multiple suitcases, each tuned a specific HDF5 layout such as NeXuS or Data Exchange.

Categories of Suitcases

The list of existing and planned suitcases groups them into three categories.

  • “One-offs” — These are tailed to one specific application, writing files to the requirements of a particular software program or user.

  • “Generics” — These write commonly-requested formats such as TIFF or CSV. There is often room for interpretation in how exactly to lay out the data into a given file format. (One TIFF file per detector? Per Event? Per exposure?) The design process can devolve into tricky judgment calls or a confusing array of options for the user. When it doubt, we encourage you to steer toward writing one or more “one-offs”.

  • “Backends” — These are less user-facing that the other two categories. They write into a file meant to be read back be a programmatic interface. For example, suitcase-mongo insert documents into MongoDB.

Creating a New Suitcase Package

Create the package with cookiecutter

  1. Install cookiecutter. This is a tool for generating a new Python package from a template.

    pip install --upgrade cookiecutter
    
  2. Use cookiecutter to create a new suitcase package. Just follow the prompts.

    cookiecutter https://github.com/NSLS-II/suitcase-cookiecutter
    
    subproject_name [ex: tiff, spec, pizza-box]: my-special-format
    subpackage_name [my_special_format]:
    

    This will have created a new directory named suitcase-my-special-format with all the “scaffolding” of a working Python package for suitcase.

  3. Initialize the directory as a git repository.

    cd suitcase-my-special-format
    git init
    git add .
    git commit -m "Initial commit"
    
  4. Install the package and its development requirements.

    pip install -e .
    pip install -r requirements-dev.txt
    

Write the Serializer

Before reading this section, read to the end of Usage.

All suitcase packages must contain a Serializer class with the interface outlined below. It should also contain an export() function. These should be in suitcase/my-special-format/__init__.py.

Here is a sketch of a Serializer

import event_model
from pathlib import Path
import suitcase.utils

class Serializer(event_model.DocumentRouter):
    def __init__(self, directory, file_prefix='{uid}', **kwargs):

        self._file_prefix = file_prefix
        self._kwargs = kwargs
        self._templated_file_prefix = ''  # set when we get a 'start' document

        if isinstance(directory, (str, Path)):
            # The user has given us a filepath; they want files.
            # Set up a MultiFileManager for them.
            self._manager = suitcase.utils.MultiFileManager(directory)
        else:
            # The user has given us their own Manager instance. Use that.
            self._manager = directory

        # Finally, we usually need some state related to stashing file
        # handles/buffers. For a Serializer that only needs *one* file
        # this may be:
        #
        # self._output_file = None
        #
        # For a Serializer that writes a separate file per stream:
        #
        # self._files = {}

    @property
    def artifacts(self):
        # The 'artifacts' are the manager's way to exposing to the user a
        # way to get at the resources that were created. For
        # `MultiFileManager`, the artifacts are filenames.  For
        # `MemoryBuffersManager`, the artifacts are the buffer objects
        # themselves. The Serializer, in turn, exposes that to the user here.
        #
        # This must be a property, not a plain attribute, because the
        # manager's `artifacts` attribute is also a property, and we must
        # access it anew each time to be sure to get the latest contents.
        return self._manager.artifacts

    def close(self):
        self._manager.close()

    # These methods enable the Serializer to be used as a context manager:
    #
    # with Serializer(...) as serializer:
    #     ...
    #
    # which always calls close() on exit from the with block.

    def __enter__(self):
        return self

    def __exit__(self, *exception_details):
        self.close()

    # Each of the methods below corresponds to a document type. As
    # documents flow in through Serializer.__call__, the DocumentRouter base
    # class will forward them to the method with the name corresponding to
    # the document's type: RunStart documents go to the 'start' method,
    # etc.
    #
    # In each of these methods:
    #
    # - If needed, obtain a new file/buffer from the manager and stash it
    #   on instance state (self._files, etc.) if you will need it again
    #   later. Example:
    #
    #   filename = f'{self._templated_file_prefix}-primary.csv'
    #   file = self._manager.open('stream_data', filename, 'xt')
    #   self._files['primary'] = file
    #
    #   See the manager documentation below for more about the arguments to open().
    #
    # - Write data into the file, usually something like:
    #
    #   content = my_function(doc)
    #   file.write(content)
    #
    #   or
    #
    #   my_function(doc, file)

    def start(self, doc):
        # Fill in the file_prefix with the contents of the RunStart document.
        # As in, '{uid}' -> 'c1790369-e4b2-46c7-a294-7abfa239691a'
        # or 'my-data-from-{plan-name}' -> 'my-data-from-scan'
        self._templated_file_prefix = self._file_prefix.format(**doc)
        ...

    def descriptor(self, doc):
        ...

    def event_page(self, doc):
        # There are other representations of Event data -- 'event' and
        # 'bulk_events' (deprecated). But that does not concern us because
        # DocumentRouter will convert this representations to 'event_page'
        # then route them through here.
        ...

    def stop(self, doc):
        ...

See the API Documentation below for more information about DocumentRouter and MultiFileManager.

Any of the existing suitcases may be useful as a reference. We recommend these in particular:

Note

Why not put the boilerplate code above into a base class, like BaseSerializer and use inheritance?

The amount of boilerplate is not large, and it may be easier to simply copy it than to cross-reference between a subclass and a base class. Additionally, the details can vary enough from one Serializer that inheritence tends to get messy.

Add an export function

This is just a simple wrapper around the Serializer. It takes a generator of (name, doc) pairs and pushes them through the Serializer.

def export(gen, directory, file_prefix='{uid}-', **kwargs):
    with Serializer(directory, file_prefix, **kwargs) as serializer:
        for item in gen:
            serializer(*item)

    return serializer.artifacts

Test the Serializer

The suitcase-utils package provides a parametrized pytest fixture, example_data for generating test data. Tests should go in suitcase/my-special-format/tests/tests.py.

import json
from suitcase.my_special_format import export, NumpyEncoder


def test_export(tmp_path, example_data):
    # Exercise the exporter on the myriad cases parametrized in example_data.
    documents = example_data()
    artifacts = export(documents, tmp_path)
    # For extra credit, read back the data
    # and check that it looks right.

Run the tests with pytest:

pytest

API Documentation

The DocumentRouter is typically useful as base class for a Serializer.

class event_model.DocumentRouter(*, emit=None)[source]

Route each document by type to a corresponding method.

When an instance is called with a document type and a document like:

router(name, doc)

the document is passed to the method of the corresponding name, as in:

getattr(router, name)(doc)

The method is expected to return None or a valid document of the same type. It may be the original instance (passed through), a copy, or a different dict altogether.

Finally, the call to router(name, doc) returns:

(name, getattr(router, name)(doc))
Parameters
emit: callable, optional

Expected signature f(name, doc)

There are “manager” classes for files and memory buffers. The user may provide their own manager class implementing a different transport mechanism. It need only implement these same methods.

class suitcase.utils.MultiFileManager(directory, allowed_modes=('x', 'xt', 'xb'))[source]

A class that manages multiple files.

Parameters
directorystr or Path

The directory (as a string or as a Path) to create teh files inside.

allowed_modesIterable

Modes accepted by MultiFileManager.open. By default this is restricted to “exclusive creation” modes (‘x’, ‘xt’, ‘xb’) which raise an error if the file already exists. This choice of defaults is meant to protect the user for unintentionally overwriting old files. In situations where overwrite (‘w’, ‘wb’) or append (‘a’, ‘r+b’) are needed, they can be added here.

This design is inspired by Python’s zipfile and tarfile libraries.
property artifacts

Provides dictionary mapping artifact labels to (file)names.

close()[source]

Close all files opened by the manager.

property estimated_sizes

Provides dictionary mapping artifact postfix to current size.

get_artifacts(label=None)[source]

Returns list of dicts, each populated with artifact properties.

Parameters
labelstring

Optional. Filter returned list to include only artifacts that match the given label value.

open(label, postfix, mode, encoding=None, errors=None)[source]

Request a file handle.

Like the built-in open function, this may be used as a context manager.

Parameters
labelstring

A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.

postfixstring

Postfix for the file name. Must be unique for this Manager.

modestring

One of the allowed_modes set in __init__``. Default set of options is {'x', 'xt', xb'} — ‘x’ or ‘xt’ for text, ‘xb’ for binary.

encodingstring or None

Passed through open. See Python open documentation for allowed values. Only applicable to text mode.

errorsstring or None

Passed through to open. See Python open documentation for allowed values.

Returns
filehandle
reserve_name(label, postfix)[source]

Ask the wrapper for a filepath.

An external library that needs a filepath (not a handle) may use this instead of the open method.

Parameters
labelstring

A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.

postfixstring

Postfix for the file name. Must be unique for this Manager.

Returns
namePath
class suitcase.utils.MemoryBuffersManager[source]

A class that manages multiple StringIO and/or BytesIO instances.

This design is inspired by Python’s zipfile and tarfile libraries.

This has a special buffers attribute which can be used to retrieve buffers created.

property artifacts

Provides dictionary mapping artifact labels to (file)names.

close()[source]

Close all buffers opened by the manager.

property estimated_sizes

Provides dictionary mapping artifact postfix to current size.

get_artifacts(label=None)[source]

Returns list of dicts, each populated with artifact properties.

Parameters
labelstring

Optional. Filter returned list to include only artifacts that match the given label value.

open(label, postfix, mode, encoding=None, errors=None)[source]

Request a file handle.

Like the built-in open function, this may be used as a context manager.

Parameters
labelstring

A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.

postfixstring

Relative file path (simply used as an identifer in this case, as there is no actual file). Must be unique for this Manager.

mode{‘x’, ‘xt’, xb’}

‘x’ or ‘xt’ for text, ‘xb’ for binary

encodingstring or None

Not used. Accepted for compatibility with built-in open().

errorsstring or None

Not used. Accepted for compatibility with built-in open().

Returns
filehandle
reserve_name(label, postfix)[source]

This action is not valid on this manager. It will always raise.

Parameters
labelstring

A label for the sort of content being stored, such as ‘stream_data’ or ‘metadata’.

postfixstring

Relative file path. Must be unique for this Manager.

Raises
SuitcaseUtilsTypeError

These classes are used by the MemoryBuffersManager.

class suitcase.utils.PersistentStringIO(initial_value='', newline='\n')[source]

A StringIO that does not clear the buffer when closed.

Note

This StringIO subclass behaves like StringIO except that its close() method, which would normally clear the buffer, has no effect. The clear() method, however, may still be used.

close()[source]

Close the IO object.

Attempting any further operation after the object is closed will raise a ValueError.

This method has no effect if the file is already closed.

class suitcase.utils.PersistentBytesIO(initial_bytes=b'')[source]

A BytesIO that does not clear the buffer when closed.

Note

This BytesIO subclass behaves like BytesIO except that its close() method, which would normally clear the buffer, has no effect. The clear() method, however, may still be used.

close()[source]

Disable all I/O operations.