What is Databroker’s relationship to Intake?

Intake Concepts

Intake has a notion of Catalogs. Catalogs are roughly dict-like. Iterating over a Catalog yields the names of its entries, which are strings. Iterating over catalog.items() yields (name, Entry) pairs. An Entry is roughly like a functools.partial with metadata and intake-specific semantics. When an Entry is opened, by calling it entry.get() or, equivalently and more succinctly, entry(), it returns its content. The content could be another Catalog or a DataSource.

Calling .read() on a DataSource returns some in-memory representation, such as a numpy array, pandas DataFrame, or xarray.Dataset. Calling .to_dask() return the “lazy” dask-backed equivalent structure.

DataBroker Concepts

DataBroker represents a Bluesky “Event Stream”, a logical table of data, as a DataSource, BlueskyEventStream. Calling BlueskyEventStream.read() returns an xarray Dataset backed by numpy arrays; calling BlueskyEventStream.to_dask() returns an xarray Dataset backed by dask arrays.

DataBroker represents a Bluesky Run, sometimes loosely referred to as a “scan”, as a Catalog of Event Streams, BlueskyRun. For example, the entries in a BlueskyRun might have the names 'primary' and 'baseline'. The entries always contain instances of BlueskyEventStream. BlueskyRun extends the standard Catalog interface with a special method :meth:BlueskyRun.documents`. This returns a generator that yields (name, doc) pairs, recreating the stream of documents that would have been emitted during data acquisition. (This is akin to Header.documents() in DataBroker v0.x.)

BlueskyEventStream and BlueskyRun should never be instantiated by the user. They have complex signatures, and they are agnostic to the storage mechanism; they could be backed by objects in memory, files, or databases.

Continuing to move up the hierarchy, we get to catalogs whose Entries contain BlueskyRun instances. Each entry’s name is the corresponding RunStart uid. The Catalogs at this level of the hierarchy include:

Notice that these are located in an internal package, _drivers. Except for testing purposes, they should never be directly imported. They should be accessed by their name from intake’s driver registry as in:

import intake
cls = intake.registry['bluesky-jsonl-catalog']

At some point in the future, once the internal APIs stabilize, these classes and their specific dependencies (msgpack, pymongo, etc.) will be moved out of databroker into separate packages. Avoid directly importing from _drivers so that this change will not break your code.

Scaling Intake Catalogs

To make Catalogs scale to tens of thousands of entries, override the methods:

  • __iter__

  • __getitem__

  • __contains__

  • __len__

A simple intake Catalog populates an internal dictionary, Catalog._entries, mapping entry names to LocalCatalogEntry objects. This approach does not scale to catalogs with large number of entries, where merely populating the keys of the Catalog._entries dict is expensive. To customize the type of _entries override Catalog._make_entries_container() and return a dict-like object. This object must support iteration (looping through part or all of the catalog in order) and random access (requesting a specific entry by name) by implementing __iter__ and __getitem__ respectively.

It should also implement __contains__ because, similarly, if __contains__ is specifically implemented, Python will iterate through all the entries and check each in turn. In this case, it is likely more efficient to implement a __contains__ method that uses __getitem__ to determine whether a given key is contained.

Finally, the Catalog itself should implement __len__. If it is not implemented, intake may obtain a Catalog’s length by iterating through it entirely, which may be costly. If a more efficient approach is possible (e.g. a COUNT query) it should be implemented.