Find Runs in a Catalog¶

In this tutorial we will:

Look up a specific Run by some identifier.
Look up a specific Run based on recency (e.g. “Show me the data I just took”).
Search for Runs using both simple and complex search queries.

Set up for Tutorial¶

Before you begin, install databroker and databroker-pack, following the Installation Tutorial.

Start your favorite interactive Python environment, such as ipython or jupyter lab.

For this tutorial, we’ll use a catalog of publicly available, openly licensed sample data. Specifically, it is high-quality transmission XAS data from all over the periodical table.

This utility downloads it and makes it discoverable to Databroker.

In [1]: import databroker.tutorial_utils

In [2]: databroker.tutorial_utils.fetch_BMM_example()
Out[2]: bluesky-tutorial-BMM:
  args:
    name: bluesky-tutorial-BMM
    paths:
    - /home/runner/.local/share/bluesky_tutorial_data/bluesky-tutorial-BMM/documents/*.msgpack
    root_map: {}
  description: ''
  driver: databroker._drivers.msgpack.BlueskyMsgpackCatalog
  metadata:
    catalog_dir: /home/runner/.local/share/intake/
    generated_by:
      library: databroker_pack
      version: 0.3.0
    relative_paths:
    - ./documents/*.msgpack

Access the catalog and assign it to a variable for convenience.

In [3]: import databroker

In [4]: catalog = databroker.catalog['bluesky-tutorial-BMM']

Look-up¶

In this section we will look up a Run by its

Globally unique identifier — unmemorable, but great for scripts
Counting-number “scan ID” — easier to remember, but not necessarily unique
Recency — e.g. “the data I just took”

If you know exactly which Run you are looking for, the surest way to get it is to look it up by its globally unique identifier, its “uid”. This is the recommended way to look up runs in scripts but it is not especially fluid for interactive use.

In [5]: catalog['c07e765b-ce5c-4c75-a16e-06f66546c1d4']
Out[5]: 
BlueskyRun
  uid='c07e765b-ce5c-4c75-a16e-06f66546c1d4'
  exit_status='success'
  2020-03-07 10:13:25.108 -- 2020-03-07 10:24:58.551
  Streams:
    * baseline
    * primary

The uid may be abbreviated. The first 7 or 8 characters are usually sufficient to uniquely identify an entry.

In [6]: catalog['c07e765']
Out[6]: 
BlueskyRun
  uid='c07e765b-ce5c-4c75-a16e-06f66546c1d4'
  exit_status='success'
  2020-03-07 10:13:25.108 -- 2020-03-07 10:24:58.551
  Streams:
    * baseline
    * primary

If the abbreviated uid is ambiguous—if it matches more than one Run—a ValueError is raised listing the matches. Try catalog['a'], which will match two Runs in this Catalog and raise that error.

Runs typically also have a counting number identifier, dubbed scan_id. This is easier to remember. Keep in mind that scan_id is not neccesarily unique, and Databroker will always give you the most recent match. Some users are in the habit of resetting scan_id to 1 at the beginning of a new experiment or operating cycle. This is why lookup based on the globally unique identifier is safest for scripts and Jupyter notebooks, especially long-lived ones.

In [7]: catalog[23463]
Out[7]: 
BlueskyRun
  uid='4393404b-8986-4c75-9a64-d7f6949a9344'
  exit_status='success'
  2020-03-07 10:29:49.483 -- 2020-03-07 10:41:20.546
  Streams:
    * baseline
    * primary

Finally, it is often convenient to access data by recency, as in “the data that I just took”.

In [8]: catalog[-1]
Out[8]: 
BlueskyRun
  uid='12a63104-f8e1-4491-9f3e-e03a30575e33'
  exit_status='success'
  2020-03-09 00:44:03.191 -- 2020-03-09 00:54:38.510
  Streams:
    * baseline
    * primary

This syntax is meant to feel similar to accessing elements in a list or array in Python, where a[-N] means “N elements from the end of a”.

In summary:

`catalog["..."]`	Globally unique identifier (“uid”)
`catalog[N]`	Counting number “scan ID” N (most recent match)
`catalog[-N]`	Nth most recent Run in the Catalog

All of these always return one BlueskyRun or raise an exception.

Search¶

Common search queries can be done with a high-level Python interface.

In [9]: from databroker.queries import TimeRange

In [10]: results = catalog.search(TimeRange(since="2020-03-05"))

The result of a search is just another Catalog. It has a subset of the original Catalog’s entries. We can compare the number of search results to the total number of entries in catalog.

In [11]: print(f"Results: {len(results)}  Total: {len(catalog)}")
Results: 61  Total: 123

We can iterate through the results for batch processing

In [12]: for uid, run in results.items():
   ....:     ...
   ....: 

or access a particular result by using any of the lookup methods in the section above, such as recency. This is a convenient way to quickly look at one search result.

In [13]: results[-1]
Out[13]: 
BlueskyRun
  uid='12a63104-f8e1-4491-9f3e-e03a30575e33'
  exit_status='success'
  2020-03-09 00:44:03.191 -- 2020-03-09 00:54:38.510
  Streams:
    * baseline
    * primary

Because results is just another Catalog, we can search on the search results to progressively narrow our results.

In [14]: narrowed_results = results.search({"num_points": {"$gt": 400}})  # Read on...

In [15]: print(f"Narrowed Results: {len(narrowed_results)}  Results: {len(results)}  Total: {len(catalog)}")
Narrowed Results: 57  Results: 61  Total: 123

Custom queries can be done with the MongoDB query language. The simplest examples check for equality of a key and value, as in

In [16]: results = catalog.search({"XDI.Element.symbol": "Mn"})

In [17]: len(results)
Out[17]: 6

The above matches Runs where the ‘start’ document looks like:

{
    ...
    "XDI": {"Element": {"symbol": "Mn"}},
    ...
}

The allowed keys are totally open-ended as far as Databroker is concerned. This example is particular to the metadata recorded by the instrument that it came from. What’s useful in your case will depend on what metadata was provided when the data was captured. Look at a couple Runs’ start documents to get a sense of the metadata that would be useful in searches.

run = catalog[-1]
run.metadata["start"]

Again, the syntax of a query is that of the MongoDB query language. It’s an expressive language for specifying searches over heterogeneous metadata.

Note

When the data is stored by some means other than MongoDB, databroker uses Python libraries that support most of MongoDB’s query language without actual MongoDB.

Here is an example of a more sophisticated query, doing more than just checking for equality.

In [18]: query = {
   ....:     "XDI.Scan.edge_energy": {"$lte": 6539.0},  # less than or equal to
   ....:     "XDI.Element.symbol": "Mn",
   ....: }
   ....: 

In [19]: results = catalog.search(query)

In [20]: len(results)
Out[20]: 6

See the MongoDB documentation linked above to learn other expressions like $lte.