.. currentmodule:: databroker.core What is Databroker's relationship to Intake? ============================================ Intake Concepts --------------- Intake has a notion of Catalogs. Catalogs are roughly dict-like. Iterating over a Catalog yields the names of its entries, which are strings. Iterating over ``catalog.items()`` yields ``(name, Entry)`` pairs. An Entry is roughly like a ``functools.partial`` with metadata and intake-specific semantics. When an Entry is opened, by calling it ``entry.get()`` or, equivalently and more succinctly, ``entry()``, it returns its content. The content could be another Catalog or a DataSource. Calling ``.read()`` on a DataSource returns some in-memory representation, such as a numpy array, pandas DataFrame, or xarray.Dataset. Calling ``.to_dask()`` return the "lazy" dask-backed equivalent structure. DataBroker Concepts ------------------- DataBroker represents a Bluesky "Event Stream", a logical table of data, as a DataSource, :class:`BlueskyEventStream`. Calling :meth:`BlueskyEventStream.read` returns an xarray Dataset backed by numpy arrays; calling :meth:`BlueskyEventStream.to_dask` returns an xarray Dataset backed by dask arrays. DataBroker represents a Bluesky Run, sometimes loosely referred to as a "scan", as a Catalog of Event Streams, :class:`BlueskyRun`. For example, the entries in a :class:`BlueskyRun` might have the names ``'primary'`` and ``'baseline'``. The entries always contain instances of :class:`BlueskyEventStream`. :class:`BlueskyRun` extends the standard Catalog interface with a special method :meth:BlueskyRun.documents`. This returns a generator that yields ``(name, doc)`` pairs, recreating the stream of documents that would have been emitted during data acquisition. (This is akin to ``Header.documents()`` in DataBroker v0.x.) :class:`BlueskyEventStream` and :class:`BlueskyRun` should never be instantiated by the user. They have complex signatures, and they are agnostic to the storage mechanism; they could be backed by objects in memory, files, or databases. Continuing to move up the hierarchy, we get to catalogs whose Entries contain :class:`BlueskyRun` instances. Each entry's name is the corresponding RunStart ``uid``. The Catalogs at this level of the hierarchy include: .. currentmodule:: databroker * :class:`_drivers.jsonl.BlueskyJSONLCatalog` * :class:`_drivers.msgpack.BlueskyMsgpackCatalog` * :class:`_drivers.mongo_normalized.BlueskyMongoCatalog` * :class:`_drivers.mongo_embedded.BlueskyMongoCatalog` Notice that these are located in an internal package, ``_drivers``. Except for testing purposes, they should never be directly imported. They should be accessed by their name from intake's driver registry as in: .. code:: python import intake cls = intake.registry['bluesky-jsonl-catalog'] At some point in the future, once the internal APIs stabilize, these classes and their specific dependencies (msgpack, pymongo, etc.) will be moved out of databroker into separate packages. Avoid directly importing from ``_drivers`` so that this change will not break your code. Scaling Intake Catalogs ----------------------- To make Catalogs scale to tens of thousands of entries, override the methods: * ``__iter__`` * ``__getitem__`` * ``__contains__`` * ``__len__`` A simple intake Catalog populates an internal dictionary, ``Catalog._entries``, mapping entry names to :class:`LocalCatalogEntry` objects. This approach does not scale to catalogs with large number of entries, where merely populating the keys of the ``Catalog._entries`` dict is expensive. To customize the type of ``_entries`` override :meth:`Catalog._make_entries_container` and return a dict-*like* object. This object must support iteration (looping through part or all of the catalog in order) and random access (requesting a specific entry by name) by implementing ``__iter__`` and ``__getitem__`` respectively. It should also implement ``__contains__`` because, similarly, if ``__contains__`` is specifically implemented, Python will iterate through all the entries and check each in turn. In this case, it is likely more efficient to implement a ``__contains__`` method that uses ``__getitem__`` to determine whether a given key is contained. Finally, the Catalog itself should implement ``__len__``. If it is not implemented, intake may obtain a Catalog's length by iterating through it entirely, which may be costly. If a more efficient approach is possible (e.g. a COUNT query) it should be implemented.