External Assets#
The Documents#
The data model allows an Event document to contain a mixture of literal values and references to externally-stored values. Scalar values and very small arrays are typically placed directly in the document; large arrays such as from area detectors are typically stored externally.
This design keeps the documents of reasonable size—suitable for storing in MongoDB or viewing directly as JSON text—and it allows large assets to be loaded only when needed.
Suppose we have this Event
# 'event' document
{'data': {'image': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c/5',
'temperature': 5.0},
'descriptor': '219310e0-faa0-4990-84a0-95b508d4ae35',
...}
where the other fields in the Event have been omitted (...
) for brevity.
We can tell that the value of 'image'
is a placeholder, a foreign key
referencing some array yet to be retrieved, by consulting the Event Descriptor
referenced by this Event.
(Of course, a human may be able to guess that the value of 'image'
looks
like a placeholder, but that wouldn’t help a program.)
Here is the Event Descriptor that goes with this Event; we can tell that
because its 'uid'
matches the Event’s 'descriptor'
field.
# 'descriptor' document
{'uid': '219310e0-faa0-4990-84a0-95b508d4ae35',
'data_keys':
{'image':
{'source': '...',
'shape': [512, 512],
'dtype': 'array',
'external': '...'},
'temperature':
{'source': '...',
'shape': [],
'dtype': 'number'}}
...}
The presence of the key 'external'
indicates that the Events’ 'image'
contains a reference to an asset outside the documents. (The value of
that key is not currently used by any part of the system; only its existence is
checked for. The value may be used in the futue as a hook for integration with
outside systems.)
Returning to our Event
# 'event' document
{'data': {'image': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c/5',
'temperature': 5.0}
'descriptor': '219310e0-faa0-4990-84a0-95b508d4ae35',
...}
now that we know that 'image'
is external, the value
'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c/5'
must be a datum_id
,
referencing a Datum document. Here is the matching Datum document. This
document can be used to retrieve some data that belong in our Event. In our
example it might be an image or stack of images that were taken at a
given temperature during a temperature scan.
# 'datum' document
{'datum_id': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c/5',
'datum_kwargs': {'index': 5},
'resource': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c'}
You will notice that we still do not have a filepath anywhere here. It is common for many Datum documents to point into the same file (e.g. a large HDF5 file) or series of files (e.g. TIFF series). Rather than store that information separately and redundantly in each Datum, the Datum documents point to a Resource document—the last document type we’ll need here—which contains path-related details.
# 'resource' document {'uid': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c', 'spec': 'AD_HDF5', 'root': '/GPFS/DATA/Andor/', 'resource_path': '2020/01/03/8ff08ff9-a2bf-48c3-8ff3-dcac0f309d7d.h5', 'resource_kwargs': {'frame_per_point': 10}, 'path_semantics': 'posix', 'uid': '3b300e6f-b431-4750-a635-5630d15c81a8', 'run_start': '10bf6945-4afd-43ca-af36-6ad8f3540bcd'}
The resource_path
is a relative path, all of which is semantic and should
usually not change during the lifecycle of this asset. The root
is more
context-dependent (depending on what system you are accessing the data from)
and subject to change (if the data is moved over time).
The spec
gives us a hint about the format of this asset, whether it be a
file, multiple files, or something more specialized. The resource_kwargs
provide any additional parameters for reading it.
# 'Stream Resource' document {'uid': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c', 'mimetype': 'application/x-hdf5', 'uri': 'file://localhost/{path}/GPFS/DATA/Andor/2020/01/03/8ff08ff9.h5', 'parameters': {'frame_per_point': 10}, 'uid': '3b300e6f-b431-4750-a635-5630d15c81a8', 'run_start': '10bf6945-4afd-43ca-af36-6ad8f3540bcd'}
The uri
specifies the location of the data. It may be a path on the local
filesystem, file://localhost/{path}
, a path on a shared filesystem
file://{host}/{path}
, to be remapped at read time via local mount config,
or a non-file-based resource like s3://...
. The {path}
part of the uri
is typically a relative path, all of which is semantic and should usually not
change during the lifecycle of this asset.
The mimetype
is a recognized standard way to specify the I/O procedures to
read the asset. It gives us a hint about the format of this asset, whether it
be a file, multiple files, or something more specialized. We support standard
mimetypes, such as image/tiff
, as well as custom ones, e.g.
application/x-hdf5-smwr-slice
. The parameters
provide any additional
parameters for reading the asset.
Handlers#
In bluesky/databroker, a “handler” is a reader with a special interface. It accepts a Resource document and a Datum document and in exchange returns the pertinent data.
Handler Interface#
A ‘handler class’ may be any callable with the signature:
handler_class(full_path, **resource_kwargs)
It is expected to return an object, a ‘handler instance’, which is also callable and has the following signature:
handler_instance(**datum_kwargs)
As the names ‘handler class’ and ‘handler instance’ suggest, this is
typically implemented using a class that implements __init__
and
__call__
, with the respective signatures.
class MyHandler:
def __init__(self, path, **resource_kwargs):
# Consume the path information and the 'resource_kwargs' from the
# Resource. Typically stashes some state and/or opens file(s).
...
def __call__(self, **datum_kwargs):
# Consumes the 'datum_kwargs' from the datum and uses them to
# locate a specific unit (slice, chunk, or what you will...) of
# data and return it.
...
return some_array_like
But in general it may be any callable-that-returns-a-callable.
def handler(path, **resource_kwargs):
def f(**datum_kwargs):
return some_array_like
return f
A handler may also implement the instance method get_file_list()
. This
presumes that the data in question comes from a filesystem, which may not
always be the case, which is why this method is optional.
A handler should implement close()
if it caches any file handles, network
connections or other system resources. The lifecycle of a handler is an
implementation detail left up to the application. Below, we comment on how
Filler
and RunRouter
make it easier
to reuse handler instances and clean them up at the proper time.
Handler Discovery#
To discover all the handlers installed in an environment, use
import databroker.core
handler_registry = databroker.core.discover_handlers()
The result, handler_registry
, is a dict mapping specs to handler classes.
It uses an efficient mechanism, described later, for searching the installed
packages for handlers. Thus, its contents will depend on which packages you
have installed. In this case, we have installed the Python package
area-detector-handlers
which includes several handlers for reading the
files output by area detectors.
{'AD_CBF': <class 'area_detector_handlers.handlers.PilatusCBFHandler'>,
'AD_HDF5': <class 'area_detector_handlers.handlers.AreaDetectorHDF5Handler'>,
'AD_HDF5_SWMR': <class 'area_detector_handlers.handlers.AreaDetectorHDF5SWMRHandler'>,
'AD_HDF5_SWMR_TS': <class 'area_detector_handlers.handlers.AreaDetectorHDF5SWMRTimestampHandler'>,
'AD_HDF5_TS': <class 'area_detector_handlers.handlers.AreaDetectorHDF5TimestampHandler'>,
'AD_SPE': <class 'area_detector_handlers.handlers.AreaDetectorSPEHandler'>,
'AD_TIFF': <class 'area_detector_handlers.handlers.AreaDetectorTiffHandler'>,
'XSP3': <class 'area_detector_handlers._xspress3.Xspress3HDF5Handler'>,
'XSP3_FLY': <class 'area_detector_handlers._xspress3.BulkXSPRESS'>}
To hook into this discovery mechanism, see the section Handler Packaging below.
Filling#
It is rarely necessary to create handlers directly. The
Filler
object is designed to consume documents from a
Run, determine which data is external, and create handlers as needed to access
the external data, and “fill” that external in, moving the datum_id
to a
separate field.
Before filling:
# 'event' document before filling
{'data': {'image': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c/5',
'temperature': 5.0},
'descriptor': '219310e0-faa0-4990-84a0-95b508d4ae35',
'filled': {'image': False}
...}
After filling:
# 'event' document after filling
{'data': {'image':, [[...]] # array-like object
'temperature': 5.0},
'descriptor': '219310e0-faa0-4990-84a0-95b508d4ae35',
'filled': {'image': 'aa10035d-1d2b-41d9-97e6-03e3fe62fa6c/5'}
...}
Notice that the datum_id
is still in the document; it has been moved out of
the way into the 'filled'
mapping. The 'filled'
mapping is a way to
track which if any keys on a document “in flight” have already been filled.
It is allowable for an Event or EventPage to be _partially_ filled, where the
'data'
mapping contains a mixture of filled and not-yet-filled items.
Fields that are not externally-stored (such as 'temperature'
in our
example) do not appear in the 'filled'
mapping. Thus, the keys in the
'filled'
mapping are subset of the keys in 'data'
.
A Filler takes in a handler_registry
, such as the one shown in the previous
section.
import event_model
filler = event_model.Filler(handler_registry)
It uses the 'spec'
in each Resource document to find a matching
handler class in its registry. If it cannot find a match for a given spec, an
UndefinedAssetSpecification
error is raised.
Resource Management#
A primary concern here is resource management. Fillers create and cache
instances of handlers, which in turn may cache instances of file handles,
network connections, or other system resources.
When a Filler is closed with close()
or used as a
context manager, it releases all its handlers which in turn should close any
resources they have allocated. The caches used by a Filler are injectable: by
default all relevant documents and handler instances are cached until the
Filler is closed, but the Filler can be configured to use any custom cache
object, such as a cachetools.LRUCache
or
cachetools.LFUCache
, to receive a prepopulated cache, or to share
caches between Filler instances. This is an implementation detail left entirely
up to the application. See Filler
for details on cache
injection. Here is an example where two Fillers share a global LRU cache:
import event_model
import cachetools
handler_registry = {...} # or use databroker.core.discover_handlers()
handler_cache = cachetools.LRUCache(32)
f1 = Filler(handler_registry, handler_cache=handler_cache)
f2 = Filler(handler_registry, handler_cache=handler_cache)
If both fillers are asked for the same Resource, they can share the same handler instance and any system resources cached therein. When the handler is evicted from the LRUCache, the Filler will recover gracefully: an instance will be recreated on demand and put back into the cache.
When streaming data from multiple runs, it is convenient to use the
RunRouter
to manage Filler creation and disposal.
It accepts a handler_registry
and other optional Filler-related arguments.
It uses them to make a separate Filler instance for each Run, which it closes
when it sees the last document from the Run.
import event_model
rr = event_model.RunRouter([...], handler_registry=handler_registry)
Handler Packaging#
Packages can use the 'databroker.handlers'
entrypoint
to declare that they include some handlers. See for example this excerpt from
the setup.py
in bluesky/area-detector-handlers
setup(
...
entry_points={
"databroker.handlers": [
"AD_SPE = area_detector_handlers.handlers:AreaDetectorSPEHandler",
"AD_TIFF = area_detector_handlers.handlers:AreaDetectorTiffHandler",
"AD_HDF5 = area_detector_handlers.handlers:AreaDetectorHDF5Handler",
"AD_HDF5_SWMR = area_detector_handlers.handlers:AreaDetectorHDF5SWMRHandler",
"AD_HDF5_TS = area_detector_handlers.handlers:AreaDetectorHDF5TimestampHandler",
"AD_HDF5_SWMR_TS = area_detector_handlers.handlers:AreaDetectorHDF5SWMRTimestampHandler",
"XSP3 = area_detector_handlers.handlers:Xspress3HDF5Handler",
"AD_CBF = area_detector_handlers.handlers:PilatusCBFHandler",
"XSP3_FLY = area_detector_handlers.handlers:BulkXSPRESS",
"IMM = area_detector_handlers.handlers:IMMHandler",
]
},
...)
On the left-hand side of the =
is given the spec, matching the 'spec'
in the Resource document, and on the right-hand side is given the
path.to.module:object_name
of the handler class that can handle that type
of asset.