Configuration Reference¶
This covers configuration for both “old” (v0) and “new” (v2 / v1) Databroker implementations and interfaces.
Current Best Practice¶
Databroker uses intake’s configuration system.
Catalog File Search Path¶
We rely on appdirs to detect local conventions for the location of application configuration files. Thus, its configuration file search path is dependent on the OS and the software environment. It can be queried like so:
python3 -c "import databroker; print(databroker.catalog_search_path())"
Within these directories, Databroker looks for YAML files. The filenames are not meaningful to Databroker.
The general structure of a catalog YAML file is a nested dictionary of data “sources”. Each source name is mapped to information for accessing that data, which includes a type of “driver” and some keyword arguments to pass to it. A “driver” is generally associated with a particular storage format.
sources:
SOME_NAME:
driver: SOME_DRIVER
args:
SOME_PARAMETER: VALUE
ANOTHER_PARAMETER: VALUE
ANOTHER_NAME:
driver: SOME_DRIVER
args:
SOME_PARAMETER: VALUE
ANOTHER_PARAMETER: VALUE
As shown, multiple sources can be specified in one file. All sources found in
all the YAML files in the search path will be included as top-level entries in
databroker.catalog
.
Optional Parameters¶
All Databroker “drivers” accept the following arguments:
handler_registry
— If ommitted orNone
, the result ofdiscover_handlers()
is used. See External Assets for background on the role of “handlers”.root_map
— This is passed toevent_model.Filler()
to account for temporarily moved/copied/remounted files. Any resources which have aroot
matching a key inroot_map
will be loaded using the mapped value inroot_map
.transforms
— A dict that maps any subset of the keys {start, stop, resource, descriptor} to a function that accepts a document of the corresponding type and returns it, potentially modified. This feature is for patching up erroneous metadata. It is intended for quick, temporary fixes that may later be applied permanently to the data at rest (e.g., via a database migration).
Specific drivers require format-specific arguments, shown in the following subsections.
Driver-Specific Parameters¶
Msgpack Example¶
Msgpack is a binary file format.
sources:
ENTRY_NAME:
driver: bluesky-msgpack-catalog
args:
paths:
- "DESTINATION_DIRECTORY/*.msgpack"
where ENTRY_NAME
is a name of the entry that will appear in
databroker.catalog
, and DESTINATION_DIRECTORY
is a directory of msgpack
files generated by suitcase-msgpack, as illustrated in the previous section.
Note that the value of paths
is a list. Multiple directories can be grouped
into one “source”.
JSONL (Newline-delimited JSON) Example¶
JSONL is a text-based format in which each line is a valid JSON. Unlike ordinary JSON, it is suitable for streaming. This storage is much slower than msgpack, but the format is human-readable.
sources:
ENTRY_NAME:
driver: bluesky-jsonl-catalog
args:
paths:
- "DESTINATION_DIRECTORY/*.jsonl"
where ENTRY_NAME
is a name of the entry that will appear in
databroker.catalog
and DESTINATION_DIRECTORY
is a directory of
newline-delimited JSON files generated by suitcase-jsonl.
Note that the value of paths
is a list. Multiple directories can be grouped
into one “source”.
MongoDB Example¶
MongoDB is the recommended storage format for large-scale deployments because it supports fast search.
sources:
ENTRY_NAME:
driver: bluesky-mongo-normalized-catalog
args:
metadatastore_db: mongodb://HOST:PORT/MDS_DATABASE_NAME
asset_registry_db: mongodb://HOST:PORT/ASSETS_DATABASE_NAME
where CATALOG_NAME
is a name of the entry that will appear in
databroker.catalog
. The two datbase URIs, metadatastore_db
and
asset_registry_db
, are distinct only for historical reasons. For new
deployments, we recommend that you set them to the same value—i.e. that
you use one database shared by both.
If you are using Databroker on the same system where you are running
MongoDB, then the URI would be mongodb://localhost:27017/DATABASE_NAME
where DATABASE_NAME
is fully up to you.
The driver’s name, bluesky-mongo-normalized-catalog
, differentiates it from
the bluesky-mongo-embedded-catalog
, an experimental alternative way of
original bluesky documents into MongoDB documents and collections. It is still
under evaluation and not yet recommended for use in production.
Advanced: Configuration via Python Package¶
To distribute catalogs to users, it may be more convenient to provide an
installable Python package, rather than placing YAML files in specific
locations on the user’s machine. To achieve this, a Python package can
advertise catalog objects using the 'intake.catalogs'
entrypoint. Here is a
minimal example:
# setup.py
from setuptools import setup
setup(name='example',
entry_points={'intake.catalogs':
['ENTRY_NAME = example:catalog_instance']},
py_modules=['example'])
# example.py
# Create an object named `catalog_instance` which is referenced in the
# setup.py, and will be discovered by Databroker. How the instance is
# created, and what type of catalog it is, is completely up to the
# implementation. This is just one possible example.
import intake
# Look up a driver class by its name in the registry.
catalog_class = intake.registry['bluesky-mongo-normalized-catalog']
catalog_instance = catalog_class(
metadatastore_db='mongodb://...', asset_registry_db='mongodb://...')
The entry_points
parameter in the setup(...)
is a feature supported by
Python packaging. When this package is installed, a special file inside the
distribution, entry_points.txt
, will advertise that is has catalogs.
DataBroker will discover these and add them to databroker.catalog
. Note
that Databroker does not need to actually import the package to discover
its catalogs. The package will only be imported if and when the catalog is
accessed. Thus, the overhead of this discovery process is low.
Important
Some critical details of Python’s entrypoints feature:
Note the unusual syntax of the entrypoints. Each item is given as one long string, with the
=
as part of the string. Modules are separated by.
, and the final object name is preceded by:
.The right hand side of the equals sign must point to where the object is actually defined. If
catalog_instance
is defined infoo/bar.py
and imported intofoo/__init__.py
you might expectfoo:catalog_instance
to work, but it does not. You must spell outfoo.bar:catalog_instance
.
Legacy (v0-style) configuration¶
For backward-compatibility, configuration files specifying MongoDB storage are
discovered and included in databroker.catalog
. Other legacy formats
(SQLite, HDF5) are only accessible via v0. See
What are the API versions v0, v1, v2?.
Search path¶
The search path for legacy configuration files differs from the new standard search path. It is, in order of highest precedence to lowest:
~/.config/databroker
(under the user’s home directory)python/../etc/databroker
, wherepython
is the current Python binary reported bysys.executable
(This allows config to be provided inside a virtual environment.)/etc/databroker/
NOTE: For Windows, we only look in: %APPDATA%\databroker
.
A configuration file must be located in one of these directories, and it must
be named with the extension .yml
. Configuration files are formatted as YAML
files.
MongoDB Example¶
This configuration file sets up a databroker that connects to a MongoDB server. This requires more work to set up.
description: 'heavyweight shared database'
metadatastore:
module: 'databroker.headersource.mongo'
class: 'MDS'
config:
host: 'localhost'
port: 27017
database: 'some_example_database'
timezone: 'US/Eastern'
assets:
module: 'databroker.assets.mongo'
class: 'Registry'
config:
host: 'localhost'
port: 27017
database: 'some_example_database'
SQLite Example¶
Warning
Storage in sqlite is deprecated, and not supported by the v2 or v1 interfaces. See Migrating sqlite or HDF5 storage to a format supported by v2 / v1.
This configuration file sets up a simple databroker backed by sqlite files. This can be used immediately with no extra setup or installation.
description: 'lightweight personal database'
metadatastore:
module: 'databroker.headersource.sqlite'
class: 'MDS'
config:
directory: 'some_directory'
timezone: 'US/Eastern'
assets:
module: 'databroker.assets.sqlite'
class: 'Registry'
config:
dbpath: 'some_directory/assets.sqlite'
Optional Parameters¶
With reference to the Optional Parameters, there are some differences in the legacy configuration files.
The
root_map
parameters is the same, and should given as a top-level key (a peer ofassets
).The
handler_registry
parameters is spelledhandlers
and has its contents specified differently, splitting up the module from the class. It is, likeroot_map
a top-level key.handlers: FOO: module: 'databroker.assets.path_only_handlers' class: 'RawHandler'
the
transforms
parameter is not supported in legacy configuration.
Migrating sqlite or HDF5 storage to a format supported by v2 / v1¶
The implementation in databroker.v0
interfaces with storage in MongoDB,
sqlite, or HDF5. The implementations in databroker.v1
and
databroker.v2
drop support for sqlite and HDF5 and add support for JSONL
(newline-delimited JSON) and msgpack. For binary file-based storage, we
recommend using msgpack. Data can be migrated from sqlite or HDF5 to msgpack
like so:
from databroker import Broker
import suitcase.msgpack
# If the config file associated with YOUR_BROKER_NAME specifies sqlite or
# HDF5 storage, then this will return a databroker.v0.Broker instance.
db = Broker.named(YOUR_BROKER_NAME)
# Loop through every run in the old Broker.
for run in db():
# Load all the documents out of this run from their existing format and
# write them into one file located at
# `<DESTINATION_DIRECTORY>/<uid>.msgpack`.
suitcase.msgpack.export(run.documents(), DESTINATION_DIRECTORY)