***************************
Administrator Documentation
***************************

When databroker is imported, it discovers catalogs available on the system.
User can list the discovered catalogs by importing a special global
``databroker.catalog`` object and listing its entries.

.. code:: python

   from databroker import catalog
   list(catalog)  # a list of strings, names of sub-catalogs

which can be accessed like

.. code:: python

   catalog['SOME_SUB_CATALOG']

DataBroker assembles this list of catalogs by looking for:

1. Old-style "databroker v0.x" YAML configuration files, for backward-compatibility
2. Intake-style catalog YAML files, which have different fields
3. Python packages that advertise catalogs via the ``intake.catalogs``
   entrypoint

Old-style databroker configuration files
========================================

DataBroker v0.x used a custom YAML-based configuration file. See
:ref:`v0_configuration`. For backward-compatibility, configuration files
specifying MongoDB storage will be discovered and included in
``databroker.catalog``.

Migrating sqlite or HDF5 storage
--------------------------------

The implementation in ``databroker.v0`` interfaces with storage in MongoDB,
sqlite, or HDF5.  The implementations in ``databroker.v1`` and
``databroker.v2`` drop support for sqlite and HDF5 and add support for JSONL_
(newline-delimited JSON) and msgpack_. For binary file-based storage, we
recommend using msgpack. Data can be migrated from sqlite or HDF5 to msgpack
like so:

.. code-block:: python

   from databroker import Broker
   import suitcase.msgpack

   # If the config file associated with YOUR_BROKER_NAME specifies sqlite or
   # HDF5 storage, then this will return a databroker.v0.Broker instance.
   db = Broker.named(YOUR_BROKER_NAME)
   # Loop through every run in the old Broker.
   for run in db():
       # Load all the documents out of this run from their existing format and
       # write them into one file located at
       # `<DESTINATION_DIRECTORY>/<uid>.msgpack`.
       suitcase.msgpack.export(run.documents(), DESTINATION_DIRECTORY)

In the next section, we'll create a "catalog YAML file" to make this data
discoverable by databroker.

Intake-style Catalog YAML Files
===============================

Search Path
-----------

Use the convenience function :func:`catalog_search_path`. Place catalog YAML
files in one of these locations to make them discoverable by intake and, in
turn, by databroker.

.. code:: python

   from databroker import catalog_search_path
   catalog_search_path()  # result will vary depending on OS and environment

Structure
---------

The general structure of a catalog YAML file is a nested dictionary of
data "sources". Each source name is mapped to information for accessing that
data, which includes a type of "driver" and some keyword arguments to pass to
it. A "driver" is generally associated with a particular storage format.

.. code:: yaml

   sources:
     SOME_NAME:
       driver: SOME_DRIVER
       args:
         SOME_PARAMETER: VALUE
         ANOTHER_PARAMETER: VALUE
     ANOTHER_NAME:
       driver: SOME_DRIVER
       args:
         SOME_PARAMETER: VALUE
         ANOTHER_PARAMETER: VALUE

As shown, multiple sources can be specified in one file. All sources found in
all the YAML files in the search path will be included as top-level entries in
``databroker.catalog``.

Arguments
---------

All databroker "drivers" accept the following arguments:

* ``handler_registry`` ---
  If ommitted or ``None``, the result of
  :func:`~databroker.core.discover_handlers` is used. See
  :doc:`event-model:external` for background on the role of "handlers".
* ``root_map`` ---
  This is passed to :func:`event_model.Filler` to account for temporarily
  moved/copied/remounted files. Any resources which have a ``root`` matching a
  key in ``root_map`` will be loaded using the mapped value in ``root_map``.
* ``transforms`` ---
  A dict that maps any subset of the keys {start, stop, resource, descriptor}
  to a function that accepts a document of the corresponding type and
  returns it, potentially modified. This feature is for patching up
  erroneous metadata. It is intended for quick, temporary fixes that
  may later be applied permanently to the data at rest
  (e.g., via a database migration).

Specific drivers require format-specific arguments, shown in the following
subsections.

Msgpack Example
---------------

Msgpack_ is a binary file format.

.. code:: yaml

   sources:
     ENTRY_NAME:
       driver: bluesky-msgpack-catalog
       args:
         paths:
           - "DESTINATION_DIRECTORY/*.msgpack"

where ``ENTRY_NAME`` is a name of the entry that will appear in
``databroker.catalog``, and ``DESTINATION_DIRECTORY`` is a directory of msgpack
files generated by suitcase-msgpack_, as illustrated in the previous section.

Note that the value of ``paths`` is a list. Multiple directories can be grouped
into one "source".

JSONL (Newline-delimited JSON) Example
--------------------------------------

JSONL_ is a text-based format in which each line is a
valid JSON. Unlike ordinary JSON, it is suitable for streaming. This storage is
much slower than msgpack, but the format is human-readable.

.. code:: yaml

   sources:
     ENTRY_NAME:
       driver: bluesky-jsonl-catalog
       args:
         paths:
           - "DESTINATION_DIRECTORY/*.jsonl"

where ``ENTRY_NAME`` is a name of the entry that will appear in
``databroker.catalog`` and ``DESTINATION_DIRECTORY`` is a directory of
newline-delimited JSON files generated by suitcase-jsonl_.

Note that the value of ``paths`` is a list. Multiple directories can be grouped
into one "source".

MongoDB Example
---------------

MongoDB_ is the recommended storage format for
large-scale deployments because it supports fast search.

.. code:: yaml

   sources:
     ENTRY_NAME:
       driver: bluesky-mongo-normalized-catalog
       args:
         metadatastore_db: mongodb://HOST:PORT/MDS_DATABASE_NAME
         asset_registry_db: mongodb://HOST:PORT/ASSETS_DATABASE_NAME

where ``ENTRY_NAME`` is a name of the entry that will appear in
``databroker.catalog``, and the ``mongodb://...`` URIs point to MongoDB
databases with documents inserted by suitcase-mongo_.

The driver's name, ``bluesky-mongo-normalized-catalog``, differentiates it from
the ``bluesky-mongo-embedded-catalog``, an experimental alternative way of
original bluesky documents into MongoDB documents and collections. It is still
under evaluation and not yet recommended for use in production.

Python packages
===============

To distribute catalogs to users, it may be more convenient to provide an
installable Python package, rather than placing YAML files in specific
locations on the user's machine.  To achieve this, a Python package can
advertise catalog objects using the ``'intake.catalogs'`` entrypoint. Here is a
minimal example:

.. code:: python

   # setup.py
   from setuptools import setup

   setup(name='example',
         entry_points={'intake.catalogs':
             ['ENTRY_NAME = example:catalog_instance']},
         py_modules=['example'])

.. code:: python

   # example.py

   # Create an object named `catalog_instance` which is referenced in the
   # setup.py, and will be discovered by databroker. How the instance is
   # created, and what type of catalog it is, is completely up to the
   # implementation. This is just one possible example.

   import intake

   # Look up a driver class by its name in the registry.
   catalog_class = intake.registry['bluesky-mongo-normalized-catalog']

   catalog_instance = catalog_class(
       metadatastore_db='mongodb://...', asset_registry_db='mongodb://...')

The ``entry_points`` parameter in the ``setup(...)`` is a feature supported by
Python packaging. When this package is installed, a special file inside the
distribution, ``entry_points.txt``, will advertise that is has catalogs.
DataBroker will discover these and add them to ``databroker.catalog``. Note
that databroker does *not* need to actually *import* the package to discover
its catalogs. The package will only be imported if and when the catalog is
accessed. Thus, the overhead of this discovery process is low.

.. important::

   Some critical details of Python's entrypoints feature:

   * Note the unusual syntax of the entrypoints. Each item is given as one long
     string, with the ``=`` as part of the string. Modules are separated by
     ``.``, and the final object name is preceded by ``:``.
   * The right hand side of the equals sign must point to where the object is
     *actually defined*. If ``catalog_instance`` is defined in
     ``foo/bar.py`` and imported into ``foo/__init__.py`` you might expect
     ``foo:catalog_instance`` to work, but it does not. You must spell out
     ``foo.bar:catalog_instance``.


.. _jsonl: http://jsonlines.org/
.. _msgpack: https://msgpack.org/index.html
.. _suitcase-mongo: https://github.com/bluesky/suitcase-mongo
.. _suitcase-jsonl: https://github.com/bluesky/suitcase-jsonl
.. _suitcase-msgpack: https://github.com/bluesky/suitcase-msgpack
.. _MongoDB: https://www.mongodb.com/