Find Runs in a Catalog¶
In this tutorial we will:
Look up a specific Run by some identifier.
Look up a specific Run based on recency (e.g. “Show me the data I just took”).
Search for Runs using both simple and complex search queries.
Set up for Tutorial¶
Before you begin, install databroker
and databroker-pack
, following the
Installation Tutorial.
Start your favorite interactive Python environment, such as ipython
or
jupyter lab
.
For this tutorial, we’ll use a catalog of publicly available, openly licensed sample data. Specifically, it is high-quality transmission XAS data from all over the periodical table.
This utility downloads it and makes it discoverable to Databroker.
In [1]: import databroker.tutorial_utils
In [2]: databroker.tutorial_utils.fetch_BMM_example()
Out[2]: bluesky-tutorial-BMM:
args:
name: bluesky-tutorial-BMM
paths:
- /home/runner/.local/share/bluesky_tutorial_data/bluesky-tutorial-BMM/documents/*.msgpack
root_map: {}
description: ''
driver: databroker._drivers.msgpack.BlueskyMsgpackCatalog
metadata:
catalog_dir: /home/runner/.local/share/intake/
generated_by:
library: databroker_pack
version: 0.3.0
relative_paths:
- ./documents/*.msgpack
Access the catalog and assign it to a variable for convenience.
In [3]: import databroker
In [4]: catalog = databroker.catalog['bluesky-tutorial-BMM']
Look-up¶
In this section we will look up a Run by its
Globally unique identifier — unmemorable, but great for scripts
Counting-number “scan ID” — easier to remember, but not necessarily unique
Recency — e.g. “the data I just took”
If you know exactly which Run you are looking for, the surest way to get it is to look it up by its globally unique identifier, its “uid”. This is the recommended way to look up runs in scripts but it is not especially fluid for interactive use.
In [5]: catalog['c07e765b-ce5c-4c75-a16e-06f66546c1d4']
Out[5]:
BlueskyRun
uid='c07e765b-ce5c-4c75-a16e-06f66546c1d4'
exit_status='success'
2020-03-07 10:13:25.108 -- 2020-03-07 10:24:58.551
Streams:
* baseline
* primary
The uid may be abbreviated. The first 7 or 8 characters are usually sufficient to uniquely identify an entry.
In [6]: catalog['c07e765']
Out[6]:
BlueskyRun
uid='c07e765b-ce5c-4c75-a16e-06f66546c1d4'
exit_status='success'
2020-03-07 10:13:25.108 -- 2020-03-07 10:24:58.551
Streams:
* baseline
* primary
If the abbreviated uid is ambiguous—if it matches more than one Run—a
ValueError
is raised listing the matches. Try catalog['a']
, which will
match two Runs in this Catalog and raise that error.
Runs typically also have a counting number identifier, dubbed scan_id
. This
is easier to remember. Keep in mind that scan_id
is not neccesarily unique,
and Databroker will always give you the most recent match.
Some users are in the habit of resetting scan_id
to 1 at the beginning of
a new experiment or operating cycle. This is why lookup based on the globally
unique identifier is safest for scripts and Jupyter notebooks, especially
long-lived ones.
In [7]: catalog[23463]
Out[7]:
BlueskyRun
uid='4393404b-8986-4c75-9a64-d7f6949a9344'
exit_status='success'
2020-03-07 10:29:49.483 -- 2020-03-07 10:41:20.546
Streams:
* baseline
* primary
Finally, it is often convenient to access data by recency, as in “the data that I just took”.
In [8]: catalog[-1]
Out[8]:
BlueskyRun
uid='12a63104-f8e1-4491-9f3e-e03a30575e33'
exit_status='success'
2020-03-09 00:44:03.191 -- 2020-03-09 00:54:38.510
Streams:
* baseline
* primary
This syntax is meant to feel similar to accessing elements in a list or array
in Python, where a[-N]
means “N
elements from the end of a
”.
In summary:
|
Globally unique identifier (“uid”) |
|
Counting number “scan ID” N (most recent match) |
|
Nth most recent Run in the Catalog |
All of these always return one BlueskyRun
or raise an exception.
Search¶
Common search queries can be done with a high-level Python interface.
In [9]: from databroker.queries import TimeRange
In [10]: results = catalog.search(TimeRange(since="2020-03-05"))
The result of a search is just another Catalog. It has a subset of the original
Catalog’s entries. We can compare the number of search results to the total
number of entries in catalog
.
In [11]: print(f"Results: {len(results)} Total: {len(catalog)}")
Results: 61 Total: 123
We can iterate through the results for batch processing
In [12]: for uid, run in results.items():
....: ...
....:
or access a particular result by using any of the lookup methods in the section above, such as recency. This is a convenient way to quickly look at one search result.
In [13]: results[-1]
Out[13]:
BlueskyRun
uid='12a63104-f8e1-4491-9f3e-e03a30575e33'
exit_status='success'
2020-03-09 00:44:03.191 -- 2020-03-09 00:54:38.510
Streams:
* baseline
* primary
Because results
is just another Catalog, we can search on the search
results to progressively narrow our results.
In [14]: narrowed_results = results.search({"num_points": {"$gt": 400}}) # Read on...
In [15]: print(f"Narrowed Results: {len(narrowed_results)} Results: {len(results)} Total: {len(catalog)}")
Narrowed Results: 57 Results: 61 Total: 123
Custom queries can be done with the MongoDB query language. The simplest examples check for equality of a key and value, as in
In [16]: results = catalog.search({"XDI.Element.symbol": "Mn"})
In [17]: len(results)
Out[17]: 6
The above matches Runs where the ‘start’ document looks like:
{
...
"XDI": {"Element": {"symbol": "Mn"}},
...
}
The allowed keys are totally open-ended as far as Databroker is concerned. This example is particular to the metadata recorded by the instrument that it came from. What’s useful in your case will depend on what metadata was provided when the data was captured. Look at a couple Runs’ start documents to get a sense of the metadata that would be useful in searches.
run = catalog[-1]
run.metadata["start"]
Again, the syntax of a query is that of the MongoDB query language. It’s an expressive language for specifying searches over heterogeneous metadata.
Note
When the data is stored by some means other than MongoDB, databroker uses Python libraries that support most of MongoDB’s query language without actual MongoDB.
Here is an example of a more sophisticated query, doing more than just checking for equality.
In [18]: query = {
....: "XDI.Scan.edge_energy": {"$lte": 6539.0}, # less than or equal to
....: "XDI.Element.symbol": "Mn",
....: }
....:
In [19]: results = catalog.search(query)
In [20]: len(results)
Out[20]: 6
See the MongoDB documentation linked above to learn other expressions like
$lte
.