Recording Metadata

Capturing useful metadata is the main objective of bluesky. The more information you can provide about what you are doing and why you are doing it, the more useful bluesky and downstream data search and analysis tools can be.

When the RunEngine executes a plan, it attaches metadata to the data it collects. It captures metadata that has been:

  1. entered interactively by the user at execution time
  2. provided in the code of the plan
  3. automatically inferred
  4. entered by user once and stashed for reuse on all future plans

If there is a conflict between these sources, the first entry in this list wins.

1. Interactively, for One Use

Suppose we are executing some custom plan called plan.

RE(plan())

If we give arbitrary extra keyword arguments to RE, they will be interpreted as metadata.

RE(plan(), sample_id='A', purpose='calibration', operator='Dan')

The run(s) — i.e., datasets — generated by plan() will include the custom metadata:

...
'sample_id': 'A',
'purpose': 'calibration'.
'operator': 'Dan',
...

If plan generates more that one run, all the runs will get this metadata. For example, this plan generates three different runs.

from bluesky.plans import count, scan
from ophyd.sim det1, det2, motor  # simulated detectors, motor

def plan():
    yield from count([det])
    yield from scan([det], motor, 1, 5, 5)
    yield from count([det])

If executed as above:

RE(plan(), sample_id='A', purpose='calibration', operator='Dan')

each run will get a copy of the sample_id, purpose and operator metadata.

2. Through a plan

Revisiting the previous example:

def plan():
    yield from count([det])
    yield from scan([det], motor, 1, 5, 5)
    yield from count([det])

we can pass different metadata for each run. Every built-in pre-assembled plan accepts a parameter md, which you can use to inject metadata that applies only to that plan.

def plan():
    yield from count([det], md={'purpose': 'calibration'})  # one
    yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data'})  # two
    yield from count([det], md={'purpose': 'sanity check'})  # three

The metadata passed into RE is combined with the metadata passed in to each plan. Thus, calling

RE(plan(), sample_id='A', operator='Dan')

generates these three sets of metadata:

# one
...
'sample_id': 'A',
'purpose': 'calibration'.
'operator': 'Dan',
...

# two
...
'sample_id': 'A',
'purpose': 'good data'.
'operator': 'Dan',
...

# three
...
'sample_id': 'A',
'purpose': 'sanity check'.
'operator': 'Dan',
...

If there is a conflict, RE keywords takes precedence. So

RE(plan(), purpose='test')

would override the individual ‘purpose’ metadata from the plan, marking all three as purpose=test.

For more on injecting metadata via plans, refer to this section of the tutorial.

Note

All of the built-in plans provide certain metadata automatically. Custom plans are not required to provide any of this, but it is a nice pattern to follow.

  • plan_name — e.g., 'scan'
  • detectors — a list of the names of the detectors
  • motors — a list of the names of the motors
  • plan_args — dict of keyword arguments passed to the plan
  • plan_pattern – function used to create the trajectory
  • plan_pattern_module — Python module where plan_pattern is defined
  • plan_pattern_args — dict of keyword arguments passed to plan_pattern to create the trajectory

The plan_name and plan_args together should provide sufficient information to recreate the plan. The detectors and motors are convenient keys to search on later.

The plan_pattern* entries provide lower-level, more explicit information about the trajectory (“pattern”) generated by the plan, separate from the specific detectors and motors involved. For complex trajectories like spirals, this is especially useful. As a simple example, here is the pattern-related metadata for scan().

...
'plan_pattern': 'linspace',
'plan_pattern_module': 'numpy',
'plan_pattern_args': dict(start=start, stop=stop, num=num)
...

Thus, one can re-create the “pattern” (trajectory) like so:

numpy.linspace(**dict(start=start, stop=stop, num=num))

3. Automatically

For each run, the RunEngine automatically records:

  • ‘time’ — In this context, the start time. (Other times are also recorded.)
  • ‘uid’ — a globally unique ID for this run
  • ‘plan_name’ — the function or class name of plan (e.g., ‘count’)
  • ‘plan_type’— e.g., the Python type of plan (e.g., ‘generator’)

The last two can be overridden by any of the methods above. The first two cannot be overridden by the user.

Note

If some custom plan does not specify a ‘plan_name’ and ‘plan_type’, the RunEngine infers them as follows:

plan_name = type(plan).__name__
plan_type = getattr(plan, '__name__', '')

These may be more or less informative depending on what plan is. They are just heuristics to provide some information by default if the plan itself and the user do not provide it.

4. Interactively, for Repeated Use

Each time a plan is executed, the current contents of RE.md are copied into the metadata for all runs generated by the plan. To enter metadata once to reuse on all plans, add it to RE.md.

RE.md['proposal_id'] = 123456
RE.md['project'] = 'flying cars'
RE.md['dimensions'] = (5, 3, 10)

View its current contents,

RE.md

delete a key you want to stop using,

del RE.md['project']   # delete a key

or use any of the standard methods that apply to dictionaries in Python.

The scan_id, an integer that the RunEngine automatically increments at the beginning of each scan, is stored in RE.md['scan_id'].

Warning

Clearing all keys, like so:

RE.md.clear()  # clear *all* keys

will reset the scan_id. The next time a plan is executed, the RunEngine will start with a scan_id of 1 and set

RE.md['scan_id'] = 1

Some readers may prefer to reset the scan ID to 1 at the beginning of a new experiment; others way wish to maintain a single unbroken sequence of scan IDs forever.

From a technical standpoint, it is fine to have duplicate scan IDs. All runs also have randomly-generated ‘uid’ (“unique ID”) which is globally unique forever.

Persistence Between Sessions

We provide a way to save the contents of the metadata stash RE.md between sessions (e.g., exiting and re-opening IPython).

In general, the RE.md attribute may be anything that supports the dictionary interface. The simplest is just a plain Python dictionary.

RE.md = {}

To persist metadata between sessions, bluesky recommends bluesky.utils.PersistentDict — a Python dictionary synced with a directory of files on disk. Any changes made to RE.md are synced to the file, so the contents of RE.md can persist between sessions.

from bluesky.utils import PersistentDict
RE.md = PersistentDict('some/path/here')

Bluesky does not provide a strong recommendation on that path; that a detail left to the local deployment.

Bluesky formerly recommended using HistoryDict — a Python dictionary backed by a sqlite database file. This approach proved problematic with the threading introduced in bluesky v1.6.0, so it is no longer recommended. If you have been following that recommendation, you should migrate your metadata from ~historydict.HistoryDict to PersistentDict. First, update your configuration to make RE.md a PersistentDict as shown above. Then, migrate like so:

old_md = get_history()
RE.md.update(old_md)

Warning

The RE.md object can also be set when the RunEngine is instantiated:

# This:
RE = RunEngine(...)

# is equivalent to this:
RE = RunEngine({})
RE.md = ...

As we stated at the start of the tutorial, if you are using bluesky at a user facility or with shared configuration, your RE may already be configured, and defining a new RE as above can result in data loss! If you aren’t sure, it’s safer to use RE.md = ....

Allowed Data Types

Custom metadata keywords can be mapped to:

  • strings — e.g., task='calibration'
  • numbers — e.g., attempt=5
  • lists or tuples — e.g., dimensions=[1, 3]
  • (nested) dictionaries — e.g., dimensions={'width': 1, 'height': 3}

Special Fields

Arbitrary custom fields are allowed — you can invent any names that are useful to you.

But certain fields are given special significance by bluesky’s document model, and are either disallowed are required to be a certain type.

The fields:

  • owner
  • group
  • project

are optional but, to facilitate searchability, if they are not blank they must be strings. A non-string, like owner=5 will produce an error that will interrupt scan execution immediately after it starts.

Similarly, the keyword sample has special significance. It must be either a string or a dictionary.

The scan_id field is expected to be an integer, and it is automatically incremented between runs. If a scan_id is not provided by the user or stashed in the persistent metadata from the previous run, it defaults to 1.

The fields:

  • uid
  • time

are reserved by the document model and cannot be set by the user.

Required Fields

In current versions of bluesky, no fields are universally required by bluesky itself. It is possible specify your own required fields in local configuration. See Metadata Validator. (At NSLS-II, there are facility-wide requirements coming soon.)

Metadata Validator

Additional, customized metadata validation can be added to the RunEngine. For example, to ensure that a run will not be executed unless the parameter ‘sample_number’ is specified, define a function that accepts a dictionary argument and raises if ‘sample_number’ is not found.

def ensure_sample_number(md):
    if 'sample_number' not in md:
        raise ValueError("You forgot the sample number.")

Apply this function by setting

RE.md_validator = ensure_sample_number

The function will be executed immediately before each new run in opened.