Recording Metadata#

Capturing useful metadata is the main objective of bluesky. The more information you can provide about what you are doing and why you are doing it, the more useful bluesky and downstream data search and analysis tools can be.

The term “metadata” can be a controversial term, one scientist’s “data” is another’s “metadata” and classification is context- dependent. The same exact information can be “data” in one experiment, but “metadata” in a different experiment done on the exact same hardware. The Document Model provides a framework for deciding _where_ to record a particular piece of information.

There are some things that we know a priori before doing an experiment; where are we? who is the user? what sample are we looking at? what did the user just ask us to do? These are all things that we can, in principle, know independent of the control system. These are the prime candidates for inclusion in the Start Document. Downstream DataBroker provides tools to do rich searches on this data. The more information you can include the better.

There is some information that we need that is nominally independent of any particular device but we need to consult the controls system about. For example the location of important, but un-scanned motors or the configuration of beam attenuators. If the values should be fixed over the course of the experiment then this it is a good candidate for being a “baseline device” either via the Supplemental pre-processor or explicitly in custom plans. This will put the readings in a separate stream (which is a peer to the “primary” data). In principle, these values could be read from the control system once and put into the Start document along with the a priori information, however that has several draw backs:

  1. There is only ever 1 reading of the values so if they do drift during data acquisition, you will never know.

  2. We cannot automatically capture information about the device like we do for data in Events. This includes things like the datatype, units, and shape of the value and any configuration information about the hardware it is being read from.

A third class of information that can be called “metadata” is configuration information of pieces of hardware. These are things like the velocity of a motor or the integration time of a detector. These readings are embedded in the Descriptor and are extracted from the hardware via the read_configuration method of the hardware. We expect that these values will not change over the course of the experiment so only read them once.

Information that does not fall into one of these categories, because you expect it to change during the experiment, should be treated as “data”, either as an explicit part of the experimental plan or via Monitoring.

Adding to the Start Document#

When the RunEngine mints a Start document it includes structured data. That information can be injected in via several mechanisms:

  1. entered interactively by the user at execution time

  2. provided in the code of the plan

  3. automatically inferred

  4. entered by user once and stashed for reuse on all future plans

If there is a conflict between these sources, the higher entry in this list wins. The “closer” to a user the information originated the higher priority it has.

1. Interactively, for One Use#

Suppose we are executing some custom plan called plan.

RE(plan())

If we give arbitrary extra keyword arguments to RE, they will be interpreted as metadata.

RE(plan(), sample_id='A', purpose='calibration', operator='Dan')

The run(s) — i.e., datasets — generated by plan() will include the custom metadata:

...
'sample_id': 'A',
'purpose': 'calibration'.
'operator': 'Dan',
...

If plan generates more that one run, all the runs will get this metadata. For example, this plan generates three different runs.

from bluesky.plans import count, scan
from ophyd.sim det1, det2, motor  # simulated detectors, motor

def plan():
    yield from count([det])
    yield from scan([det], motor, 1, 5, 5)
    yield from count([det])

If executed as above:

RE(plan(), sample_id='A', purpose='calibration', operator='Dan')

each run will get a copy of the sample_id, purpose and operator metadata.

2. Through a plan#

Revisiting the previous example:

def plan():
    yield from count([det])
    yield from scan([det], motor, 1, 5, 5)
    yield from count([det])

we can pass different metadata for each run. Every built-in pre-assembled plan accepts a parameter md, which you can use to inject metadata that applies only to that plan.

def plan():
    yield from count([det], md={'purpose': 'calibration'})  # one
    yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data'})  # two
    yield from count([det], md={'purpose': 'sanity check'})  # three

The metadata passed into RE is combined with the metadata passed in to each plan. Thus, calling

RE(plan(), sample_id='A', operator='Dan')

generates these three sets of metadata:

# one
...
'sample_id': 'A',
'purpose': 'calibration'.
'operator': 'Dan',
...

# two
...
'sample_id': 'A',
'purpose': 'good data'.
'operator': 'Dan',
...

# three
...
'sample_id': 'A',
'purpose': 'sanity check'.
'operator': 'Dan',
...

If there is a conflict, RE keywords takes precedence. So

RE(plan(), purpose='test')

would override the individual ‘purpose’ metadata from the plan, marking all three as purpose=test.

If you define your own plans, it is best practice have them take a keyword only argument md=None. This allows the hard-coded meta-data to be over-ridden later:

def plan(*, md=None):
    md = md or {}  # handle the default case
    # putting unpacking **md at the end means it "wins"
    # and if the user calls
    #    yield from plan(md={'purpose': bob})
    # it will over-ride these values
    yield from count([det], md={'purpose': 'calibration', **md})
    yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data', **md})
    yield from count([det], md={'purpose': 'sanity check', **md})

This is consistent with all of the Pre-assembled Plans.

For more on injecting metadata via plans, refer to this section of the tutorial.

Note

All of the built-in plans provide certain metadata automatically. Custom plans are not required to provide any of this, but it is a nice pattern to follow.

  • plan_name — e.g., 'scan'

  • detectors — a list of the names of the detectors

  • motors — a list of the names of the motors

  • plan_args — dict of keyword arguments passed to the plan

  • plan_pattern – function used to create the trajectory

  • plan_pattern_module — Python module where plan_pattern is defined

  • plan_pattern_args — dict of keyword arguments passed to plan_pattern to create the trajectory

The plan_name and plan_args together should provide sufficient information to recreate the plan. The detectors and motors are convenient keys to search on later.

The plan_pattern* entries provide lower-level, more explicit information about the trajectory (“pattern”) generated by the plan, separate from the specific detectors and motors involved. For complex trajectories like spirals, this is especially useful. As a simple example, here is the pattern-related metadata for scan().

...
'plan_pattern': 'linspace',
'plan_pattern_module': 'numpy',
'plan_pattern_args': dict(start=start, stop=stop, num=num)
...

Thus, one can re-create the “pattern” (trajectory) like so:

numpy.linspace(**dict(start=start, stop=stop, num=num))

3. Automatically#

For each run, the RunEngine automatically records:

  • ‘time’ — In this context, the start time. (Other times are also recorded.)

  • ‘uid’ — a globally unique ID for this run

  • ‘plan_name’ — the function or class name of plan (e.g., ‘count’)

  • ‘plan_type’— e.g., the Python type of plan (e.g., ‘generator’)

The last two can be overridden by any of the methods above. The first two cannot be overridden by the user.

Note

If some custom plan does not specify a ‘plan_name’ and ‘plan_type’, the RunEngine infers them as follows:

plan_name = type(plan).__name__
plan_type = getattr(plan, '__name__', '')

These may be more or less informative depending on what plan is. They are just heuristics to provide some information by default if the plan itself and the user do not provide it.

4. Interactively, for Repeated Use#

Each time a plan is executed, the current contents of RE.md are copied into the metadata for all runs generated by the plan. To enter metadata once to reuse on all plans, add it to RE.md.

RE.md['proposal_id'] = 123456
RE.md['project'] = 'flying cars'
RE.md['dimensions'] = (5, 3, 10)

View its current contents,

RE.md

delete a key you want to stop using,

del RE.md['project']   # delete a key

or use any of the standard methods that apply to dictionaries in Python.

Warning

In general we recommend against putting device readings in the Start document. (The Start document is for who/what/why/when, things you know before you start communicating with hardware.) It is especially critical that you do not put device readings in the RE.md dictionary. The value will remain until you change it and not track the state of the hardware. This will result in recording out-of-date, incorrect data!

This can be particularly dangerous if RE.md is backed by a persistent data store (see next section) because out-of-date readings will last across sessions.

The scan_id, an integer that the RunEngine automatically increments at the beginning of each scan, is stored in RE.md['scan_id'].

Warning

Clearing all keys, like so:

RE.md.clear()  # clear *all* keys

will reset the scan_id. The next time a plan is executed, the RunEngine will start with a scan_id of 1 and set

RE.md['scan_id'] = 1

Some readers may prefer to reset the scan ID to 1 at the beginning of a new experiment; others way wish to maintain a single unbroken sequence of scan IDs forever.

From a technical standpoint, it is fine to have duplicate scan IDs. All runs also have randomly-generated ‘uid’ (“unique ID”) which is globally unique forever.

Persistence Between Sessions#

We provide a way to save the contents of the metadata stash RE.md between sessions (e.g., exiting and re-opening IPython).

In general, the RE.md attribute may be anything that supports the dictionary interface. The simplest is just a plain Python dictionary.

RE.md = {}

To persist metadata between sessions, bluesky provides bluesky.utils.PersistentDict — a Python dictionary synced with a directory of files on disk backed by zict. Any changes made to RE.md are synced to the file, so the contents of RE.md can persist between sessions.

from bluesky.utils import PersistentDict
RE.md = PersistentDict('some/path/here')

zict v3 changed how the contents of are serialized to disk such that only one Python process can reliably interact with the files at a time, which is suitable for testing and small-scale applications. Installing zict v2 avoids this problem, but causes conflicts with other packages (such as dask).

Bluesky formerly recommended using HistoryDict — a Python dictionary backed by a sqlite database file. This approach proved problematic with the threading introduced in bluesky v1.6.0, so it is no longer recommended. If you have been following that recommendation, you should migrate your metadata from ~historydict.HistoryDict to PersistentDict. First, update your configuration to make RE.md a PersistentDict as shown above. Then, migrate like so:

from bluesky.utils import get_history
old_md = get_history()
RE.md.update(old_md)

The PersistentDict object has been back-ported to bluesky v1.5.6 as well. It is not available in 1.4.x or older, so once you move to the new system, you must run bluesky v1.5.6 or higher.

Warning

The RE.md object can also be set when the RunEngine is instantiated:

# This:
RE = RunEngine(...)

# is equivalent to this:
RE = RunEngine({})
RE.md = ...

As we stated at the start of the tutorial, if you are using bluesky at a user facility or with shared configuration, your RE may already be configured, and defining a new RE as above can result in data loss! If you aren’t sure, it’s safer to use RE.md = ....

Allowed Data Types#

Custom metadata keywords can be mapped to:

  • strings — e.g., task='calibration'

  • numbers — e.g., attempt=5

  • lists or tuples — e.g., dimensions=[1, 3]

  • (nested) dictionaries — e.g., dimensions={'width': 1, 'height': 3}

Required Fields#

The fields:

  • uid

  • time

are reserved by the document model and cannot be set by the user.

In current versions of bluesky, no fields are universally required by bluesky itself. It is possible specify your own required fields in local configuration. See Validation. (At NSLS-II, there are facility-wide requirements coming soon.)

Special Fields#

Arbitrary custom fields are allowed — you can invent any names that are useful to you.

But certain fields are given special significance by bluesky’s document model, and are either disallowed are required to be a certain type.

The fields:

  • owner

  • group

  • project

are optional but, to facilitate searchability, if they are not blank they must be strings. A non-string, like owner=5 will produce an error that will interrupt scan execution immediately after it starts.

Similarly, the keyword sample has special significance. It must be either a string or a dictionary.

The scan_id field is expected to be an integer, and it is automatically incremented between runs. If a scan_id is not provided by the user or stashed in the persistent metadata from the previous run, it defaults to 1.

Validation#

Additional, customized metadata validation can be added to the RunEngine. For example, to ensure that a run will not be executed unless the parameter ‘sample_number’ is specified, define a function that accepts a dictionary argument and raises if ‘sample_number’ is not found.

def ensure_sample_number(md):
    if 'sample_number' not in md:
        raise ValueError("You forgot the sample number.")

Apply this function by setting

RE.md_validator = ensure_sample_number

The function will be executed immediately before each new run in opened.