User Documentation

Important

DataBroker release 1.0 includes support for old-style “v1” usage and new-style “v2” usage. This section addresses databroker’s new “v2” usage. It is still under development and subject to change in response to user feedback.

For the stable usage “v1” usage, see Version 1 Interface. See Transition Plan for more information.

Walkthrough

Find a Catalog

When databroker is first imported, it searches for Catalogs on your system, typically provided by a Python package or configuration file that you or an administrator installed.

In [1]: from databroker import catalog

In [2]: list(catalog)
Out[2]: ['csx', 'chx', 'isr', 'xpd', 'sst', 'bmm', 'lix']

Each entry is a Catalog that databroker discovered on our system. In this example, we find Catalogs corresponding to different instruments/beamlines. We can access a subcatalog with square brackets, like accessing an item in a dictionary.

In [3]: catalog['csx']
Out[3]: <Intake catalog: csx>

List the entries in the ‘csx’ Catalog.

In [4]: list(catalog['csx'])
Out[4]: ['raw']

We see Catalogs for raw data and processed data. Let’s access the raw one and assign it to a variable for convenience.

In [5]: raw = catalog['csx']['raw']

This Catalog contains all the raw data taken at CSX. It contains many entries, as we can see by checking len(raw) so listing it would take awhile. Instead, we’ll look up entries by name or by search.

Note

As an alternative to list(...), try using tab-completion to view your options. Typing catalog[' and then hitting the TAB key will list the available entries.

Also, these shortcuts can save a little typing.

# These three lines are equivalent.
catalog['csx']['raw']
catalog['csx', 'raw']
catalog.csx.raw  # only works if the entry names are valid Python identifiers

Look up a Run by ID

Suppose you know the unique ID of a run (a.k.a “scan”) that we want to access. Note that the first several characters will do; usually 6-8 are enough to uniquely identify a given run.

In [6]: run = raw[uid]  # where uid is some string like '17531ace'

Each run also has a scan_id. The scan_id is usually easier to remember (it’s a counting number, not a random string) but it may not be globally unique. If there are collisions, you’ll get the most recent match, so the unique ID is better as a long-term reference.

In [7]: run = raw[1]

Search for Runs

Suppose you want to sift through multiple runs to examine a range of datasets.

In [8]: query = {'proposal_id': 12345}  # or, equivalently, dict(proposal_id=12345)

In [9]: search_results = raw.search(query)

The result, search_results, is itself a Catalog.

In [10]: search_results
Out[10]: <Intake catalog: search results>

We can quickly check how many results it contains

In [11]: len(search_results)
Out[11]: 5

and, if we want, list them.

In [12]: list(search_results)
Out[12]: 
['68ab26a0-639e-4eac-a2b5-328c09bb0b96',
 'b0c3aa92-fbdd-4415-94a8-328860c97ef3',
 'e2788b16-b922-42a7-a549-04b81c83543f',
 '11fd6081-cc40-4c46-a74f-ccdaa5bf785a',
 '9e86b8dc-2150-4aea-9286-601d8a090799']

Because searching on a Catalog returns another Catalog, we refine our search by searching search_results. In this example we’ll use a helper, TimeRange, to build our query.

In [13]: from databroker.queries import TimeRange

In [14]: query = TimeRange(since='2019-09-01', until='2040')

In [15]: search_results.search(query)
Out[15]: <Intake catalog: search results>

Other sophisticated queries are possible, such as filtering for scans that include greater than 50 points.

search_results.search({'num_points': {'$gt': 50}})

See MongoQuerySelectors for more.

Once we have a result catalog that we are happy with we can list the entries via list(search_results), access them individually by names as in search_results[SOME_UID] or loop through them:

In [16]: for uid, run in search_results.items():
   ....:     ...
   ....: 

Access Data

Suppose we have a run of interest.

In [17]: run = raw[uid]

A given run contains multiple logical tables. The number of these tables and their names varies by the particular experiment, but two common ones are

  • ‘primary’, the main data of interest, such as a time series of images

  • ‘baseline’, readings taken at the beginning and end of the run for alignment and sanity-check purposes

To explore a run, we can open its entry by calling it like a function with no arguments:

In [18]: run()  # or, equivalently, run.get()
Out[18]: 
Run Catalog
  uid='9e86b8dc-2150-4aea-9286-601d8a090799'
  exit_status='success'
  2019-11-12 09:13:56.478 -- 2019-11-12 09:13:56.488
  Streams:
    * baseline
    * primary

We can also use tab-completion, as in entry[' TAB, to see the contents. That is, the Run is yet another Catalog, and its contents are the logical tables of data. Finally, let’s get one of these tables.

In [19]: ds = run.primary.read()

In [20]: ds
Out[20]: 
<xarray.Dataset>
Dimensions:                   (dim_0: 10, dim_1: 10, time: 3)
Coordinates:
  * time                      (time) float64 1.574e+09 1.574e+09 1.574e+09
Dimensions without coordinates: dim_0, dim_1
Data variables:
    img                       (time, dim_0, dim_1) float64 1.0 1.0 ... 1.0 1.0
    motor                     (time) float64 -1.0 0.0 1.0
    motor_setpoint            (time) float64 -1.0 0.0 1.0
    img:img                   (time) <U38 'aabc0bb1-8b96-4b07-983d-def5881a4bb1/0' ... 'aabc0bb1-8b96-4b07-983d-def5881a4bb1/0'
    motor:motor_velocity      (time) int64 1 1 1
    motor:motor_acceleration  (time) int64 1 1 1
    seq_num                   (time) int64 1 2 3
    uid                       (time) <U36 '82ffcc1e-2023-4711-b661-126c34a9697f' ... 'ac9f51a9-ffa1-4d44-a15d-672c3a8b01dc'

This is an xarray.Dataset. You can access specific columns

In [21]: ds['img']
Out[21]: 
<xarray.DataArray 'img' (time: 3, dim_0: 10, dim_1: 10)>
array([[[1., 1., ..., 1., 1.],
        [1., 1., ..., 1., 1.],
        ...,
        [1., 1., ..., 1., 1.],
        [1., 1., ..., 1., 1.]],

       [[1., 1., ..., 1., 1.],
        [1., 1., ..., 1., 1.],
        ...,
        [1., 1., ..., 1., 1.],
        [1., 1., ..., 1., 1.]],

       [[1., 1., ..., 1., 1.],
        [1., 1., ..., 1., 1.],
        ...,
        [1., 1., ..., 1., 1.],
        [1., 1., ..., 1., 1.]]])
Coordinates:
  * time     (time) float64 1.574e+09 1.574e+09 1.574e+09
Dimensions without coordinates: dim_0, dim_1

do mathematical operations

In [22]: ds.mean()
Out[22]: 
<xarray.Dataset>
Dimensions:                   ()
Data variables:
    img                       float64 1.0
    motor                     float64 0.0
    motor_setpoint            float64 0.0
    motor:motor_velocity      float64 1.0
    motor:motor_acceleration  float64 1.0
    seq_num                   float64 2.0

make quick plots

In [23]: ds['motor'].plot()
Out[23]: [<matplotlib.lines.Line2D at 0x7fec9e377f60>]
../../_images/ds_motor_plot.png

and much more. See the documentation on xarray.

If the data is large, it can be convenient to access it lazily, deferring the actual loading network or disk I/O. To do this, replace read() with to_dask(). You still get back an xarray.Dataset, but it contains placeholders that will fetch the data in chunks and only as needed, rather than greedily pulling all the data into memory from the start.

In [24]: ds = run.primary.to_dask()

In [25]: ds
Out[25]: 
<xarray.Dataset>
Dimensions:                   (dim_0: 10, dim_1: 10, time: 3)
Coordinates:
  * time                      (time) float64 1.574e+09 1.574e+09 1.574e+09
Dimensions without coordinates: dim_0, dim_1
Data variables:
    img                       (time, dim_0, dim_1) float64 1.0 1.0 ... 1.0 1.0
    motor                     (time) float64 -1.0 0.0 1.0
    motor_setpoint            (time) float64 -1.0 0.0 1.0
    img:img                   (time) <U38 'aabc0bb1-8b96-4b07-983d-def5881a4bb1/0' ... 'aabc0bb1-8b96-4b07-983d-def5881a4bb1/0'
    motor:motor_velocity      (time) int64 1 1 1
    motor:motor_acceleration  (time) int64 1 1 1
    seq_num                   (time) int64 1 2 3
    uid                       (time) <U36 '82ffcc1e-2023-4711-b661-126c34a9697f' ... 'ac9f51a9-ffa1-4d44-a15d-672c3a8b01dc'

See the documentation on dask.

TODO: This is displaying numpy arrays, not dask. Illustrating dask here might require standing up a server.

Replay Document Stream

Bluesky is built around a streaming-friendly representation of data and metadata. (See event-model.) To access the run—effectively replaying the chronological stream of documents that were emitted during data acquisition—use the canonical() method.

In [26]: run.canonical(fill='yes')
Out[26]: <generator object BlueskyRun.canonical at 0x7fec9e321840>

This generator yields (name, doc) pairs and can be fed into streaming visualization, processing, and serialization tools that consume this representation, such as those provided by bluesky.

The keyword argument fill is required. Its allowed values are 'yes' (numpy arrays)`, 'no' (Datum IDs), and 'delayed' (dask arrays, still under development).