Recording Metadata#
Capturing useful metadata is the main objective of bluesky. The more information you can provide about what you are doing and why you are doing it, the more useful bluesky and downstream data search and analysis tools can be.
The term “metadata” can be a controversial term, one scientist’s “data” is another’s “metadata” and classification is context- dependent. The same exact information can be “data” in one experiment, but “metadata” in a different experiment done on the exact same hardware. The Document Model provides a framework for deciding _where_ to record a particular piece of information.
There are some things that we know a priori before doing an experiment; where are we? who is the user? what sample are we looking at? what did the user just ask us to do? These are all things that we can, in principle, know independent of the control system. These are the prime candidates for inclusion in the Start Document. Downstream DataBroker provides tools to do rich searches on this data. The more information you can include the better.
There is some information that we need that is nominally independent of any particular device but we need to consult the controls system about. For example the location of important, but un-scanned motors or the configuration of beam attenuators. If the values should be fixed over the course of the experiment then this it is a good candidate for being a “baseline device” either via the Supplemental pre-processor or explicitly in custom plans. This will put the readings in a separate stream (which is a peer to the “primary” data). In principle, these values could be read from the control system once and put into the Start document along with the a priori information, however that has several draw backs:
There is only ever 1 reading of the values so if they do drift during data acquisition, you will never know.
We cannot automatically capture information about the device like we do for data in Events. This includes things like the datatype, units, and shape of the value and any configuration information about the hardware it is being read from.
A third class of information that can be called “metadata” is configuration information of pieces of hardware. These are things like the velocity of a motor or the integration time of a detector. These readings are embedded in the Descriptor and are extracted from the hardware via the read_configuration method of the hardware. We expect that these values will not change over the course of the experiment so only read them once.
Information that does not fall into one of these categories, because you expect it to change during the experiment, should be treated as “data”, either as an explicit part of the experimental plan or via Monitoring.
Adding to the Start Document#
When the RunEngine mints a Start document it includes structured data. That information can be injected in via several mechanisms:
entered interactively by the user at execution time
provided in the code of the plan
automatically inferred
entered by user once and stashed for reuse on all future plans
If there is a conflict between these sources, the higher entry in this list wins. The “closer” to a user the information originated the higher priority it has.
1. Interactively, for One Use#
Suppose we are executing some custom plan called plan
.
RE(plan())
If we give arbitrary extra keyword arguments to RE
, they will be
interpreted as metadata.
RE(plan(), sample_id='A', purpose='calibration', operator='Dan')
The run(s) — i.e., datasets — generated by plan()
will include the custom metadata:
...
'sample_id': 'A',
'purpose': 'calibration'.
'operator': 'Dan',
...
If plan
generates more that one run, all the runs will get this metadata.
For example, this plan generates three different runs.
from bluesky.plans import count, scan
from ophyd.sim det1, det2, motor # simulated detectors, motor
def plan():
yield from count([det])
yield from scan([det], motor, 1, 5, 5)
yield from count([det])
If executed as above:
RE(plan(), sample_id='A', purpose='calibration', operator='Dan')
each run will get a copy of the sample_id, purpose and operator metadata.
2. Through a plan#
Revisiting the previous example:
def plan():
yield from count([det])
yield from scan([det], motor, 1, 5, 5)
yield from count([det])
we can pass different metadata for each run. Every
built-in pre-assembled plan accepts a parameter
md
, which you can use to inject metadata that applies only to that plan.
def plan():
yield from count([det], md={'purpose': 'calibration'}) # one
yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data'}) # two
yield from count([det], md={'purpose': 'sanity check'}) # three
The metadata passed into RE
is combined with the metadata passed in to each
plan. Thus, calling
RE(plan(), sample_id='A', operator='Dan')
generates these three sets of metadata:
# one
...
'sample_id': 'A',
'purpose': 'calibration'.
'operator': 'Dan',
...
# two
...
'sample_id': 'A',
'purpose': 'good data'.
'operator': 'Dan',
...
# three
...
'sample_id': 'A',
'purpose': 'sanity check'.
'operator': 'Dan',
...
If there is a conflict, RE
keywords takes precedence. So
RE(plan(), purpose='test')
would override the individual ‘purpose’ metadata from the plan, marking all three as purpose=test.
If you define your own plans, it is best practice have them take a keyword only
argument md=None
. This allows the hard-coded meta-data to be over-ridden
later:
def plan(*, md=None):
md = md or {} # handle the default case
# putting unpacking **md at the end means it "wins"
# and if the user calls
# yield from plan(md={'purpose': bob})
# it will over-ride these values
yield from count([det], md={'purpose': 'calibration', **md})
yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data', **md})
yield from count([det], md={'purpose': 'sanity check', **md})
This is consistent with all of the Pre-assembled Plans.
For more on injecting metadata via plans, refer to this section of the tutorial.
Note
All of the built-in plans provide certain metadata automatically. Custom plans are not required to provide any of this, but it is a nice pattern to follow.
plan_name — e.g.,
'scan'
detectors — a list of the names of the detectors
motors — a list of the names of the motors
plan_args — dict of keyword arguments passed to the plan
plan_pattern – function used to create the trajectory
plan_pattern_module — Python module where
plan_pattern
is definedplan_pattern_args — dict of keyword arguments passed to
plan_pattern
to create the trajectory
The plan_name
and plan_args
together should provide sufficient
information to recreate the plan. The detectors
and motors
are
convenient keys to search on later.
The plan_pattern*
entries provide lower-level, more explicit
information about the trajectory (“pattern”) generated by the plan,
separate from the specific detectors and motors involved. For complex
trajectories like spirals, this is especially useful. As a simple example,
here is the pattern-related metadata for scan()
.
...
'plan_pattern': 'linspace',
'plan_pattern_module': 'numpy',
'plan_pattern_args': dict(start=start, stop=stop, num=num)
...
Thus, one can re-create the “pattern” (trajectory) like so:
numpy.linspace(**dict(start=start, stop=stop, num=num))
3. Automatically#
For each run, the RunEngine automatically records:
‘time’ — In this context, the start time. (Other times are also recorded.)
‘uid’ — a globally unique ID for this run
‘plan_name’ — the function or class name of
plan
(e.g., ‘count’)‘plan_type’— e.g., the Python type of
plan
(e.g., ‘generator’)
The last two can be overridden by any of the methods above. The first two cannot be overridden by the user.
Note
If some custom plan does not specify a ‘plan_name’ and ‘plan_type’, the RunEngine infers them as follows:
plan_name = type(plan).__name__
plan_type = getattr(plan, '__name__', '')
These may be more or less informative depending on what plan
is. They
are just heuristics to provide some information by default if the plan
itself and the user do not provide it.
4. Interactively, for Repeated Use#
Each time a plan is executed, the current contents of RE.md
are copied into
the metadata for all runs generated by the plan. To enter metadata once to
reuse on all plans, add it to RE.md
.
RE.md['proposal_id'] = 123456
RE.md['project'] = 'flying cars'
RE.md['dimensions'] = (5, 3, 10)
View its current contents,
RE.md
delete a key you want to stop using,
del RE.md['project'] # delete a key
or use any of the standard methods that apply to dictionaries in Python.
Warning
In general we recommend against putting device readings in the Start
document. (The Start document is for who/what/why/when, things you
know before you start communicating with hardware.) It is especially
critical that you do not put device readings in the RE.md
dictionary.
The value will remain until you change it and not track the state of the
hardware. This will result in recording out-of-date, incorrect data!
This can be particularly dangerous if RE.md
is backed by a
persistent data store (see next section) because out-of-date readings will
last across sessions.
The scan_id
, an integer that the RunEngine automatically increments at the
beginning of each scan, is stored in RE.md['scan_id']
.
Warning
Clearing all keys, like so:
RE.md.clear() # clear *all* keys
will reset the scan_id
. The next time a plan is executed, the
RunEngine will start with a scan_id
of 1 and set
RE.md['scan_id'] = 1
Some readers may prefer to reset the scan ID to 1 at the beginning of a new experiment; others way wish to maintain a single unbroken sequence of scan IDs forever.
From a technical standpoint, it is fine to have duplicate scan IDs. All runs also have randomly-generated ‘uid’ (“unique ID”) which is globally unique forever.
Persistence Between Sessions#
We provide a way to save the contents of the metadata stash RE.md
between
sessions (e.g., exiting and re-opening IPython).
In general, the RE.md
attribute may be anything that supports the
dictionary interface. The simplest is just a plain Python dictionary.
RE.md = {}
To persist metadata between sessions, bluesky recommends
bluesky.utils.PersistentDict
— a Python dictionary synced with a
directory of files on disk. Any changes made to RE.md
are synced to the
file, so the contents of RE.md
can persist between sessions.
from bluesky.utils import PersistentDict
RE.md = PersistentDict('some/path/here')
Bluesky does not provide a strong recommendation on that path; that a detail left to the local deployment.
Bluesky formerly recommended using HistoryDict
— a
Python dictionary backed by a sqlite database file. This approach proved
problematic with the threading introduced in bluesky v1.6.0, so it is no longer
recommended. If you have been following that recommendation, you should migrate
your metadata from ~historydict.HistoryDict to
PersistentDict
. First, update your configuration to
make RE.md
a PersistentDict
as shown above. Then,
migrate like so:
from bluesky.utils import get_history
old_md = get_history()
RE.md.update(old_md)
The PersistentDict
object has been back-ported to
bluesky v1.5.6 as well. It is not available in 1.4.x or older, so once you move
to the new system, you must run bluesky v1.5.6 or higher.
Warning
The RE.md
object can also be set when the RunEngine is instantiated:
# This:
RE = RunEngine(...)
# is equivalent to this:
RE = RunEngine({})
RE.md = ...
As we stated
at the start of the tutorial, if you are
using bluesky at a user facility or with shared configuration, your
RE
may already be configured, and defining a new RE
as above can
result in data loss! If you aren’t sure, it’s safer to use RE.md = ...
.
Allowed Data Types#
Custom metadata keywords can be mapped to:
strings — e.g.,
task='calibration'
numbers — e.g.,
attempt=5
lists or tuples — e.g.,
dimensions=[1, 3]
(nested) dictionaries — e.g.,
dimensions={'width': 1, 'height': 3}
Required Fields#
The fields:
uid
time
are reserved by the document model and cannot be set by the user.
In current versions of bluesky, no fields are universally required by bluesky itself. It is possible specify your own required fields in local configuration. See Validation. (At NSLS-II, there are facility-wide requirements coming soon.)
Special Fields#
Arbitrary custom fields are allowed — you can invent any names that are useful to you.
But certain fields are given special significance by bluesky’s document model, and are either disallowed are required to be a certain type.
The fields:
owner
group
project
are optional but, to facilitate searchability, if they are not blank they must
be strings. A non-string, like owner=5
will produce an error that will
interrupt scan execution immediately after it starts.
Similarly, the keyword sample has special significance. It must be either a string or a dictionary.
The scan_id field is expected to be an integer, and it is automatically incremented between runs. If a scan_id is not provided by the user or stashed in the persistent metadata from the previous run, it defaults to 1.
Validation#
Additional, customized metadata validation can be added to the RunEngine. For example, to ensure that a run will not be executed unless the parameter ‘sample_number’ is specified, define a function that accepts a dictionary argument and raises if ‘sample_number’ is not found.
def ensure_sample_number(md):
if 'sample_number' not in md:
raise ValueError("You forgot the sample number.")
Apply this function by setting
RE.md_validator = ensure_sample_number
The function will be executed immediately before each new run in opened.