Reference

Python API

databroker_pack.export_catalog(source_catalog, directory, *, strict=False, external=None, no_documents=False, handler_registry=None, serializer_class=None, salt=None, limit=None)[source]

Export all the Runs from a Catalog.

Parameters
source_catalog: Catalog
directory: Union[Str, Manager]

Where files containing documents will be written, or a Manager for writing to non-file buffers.

strict: Bool, optional

By default, swallow erros and return a list of them at the end. Set to True to debug errors.

external: {None, ‘fill’, ‘ignore’)

If None, return the paths to external files. If ‘fill’, fill the external data into the Documents. If ‘ignore’, do not locate external files.

no_documents: Bool, optional

If True, do not serialize documents. False by default.

handler_registry: Union[Dict, None]

If None, automatic handler discovery is used.

serializer_class: Serializer

Expected to be a lossless serializer that encodes a format for which there is a corresponding databroker intake driver. Default (None) is currently suitcase.msgpack.Serializer, but this may change in the future. If you want suitcase.msgpack.Serializer specifically, pass it in explicitly.

salt: Union[bytes, None]

We want to make hashes is unique to:

  • a root

  • a given batch of exported runs (i.e., a given call to this function)

so that we can use it as a key in root_map which is guaranteed not to collide with keys from other batches. Thus, we create a “salt” unless one is specified here. This does not need to be cryptographically secure, just unique.

limit: Union[Integer, None]

Stop after exporting some number of Runs. Useful for testing a subset before doing a lengthy export.

Returns
artifacts, files, failures, file_uids

Notes

  • artifacts maps a human-readable string (typically just 'all' in this case) to a list of buffers or filepaths where the documents were serialized.

  • files is the set of filepaths of all external files referenced by Resource documents, keyed on (root_in_document, root, unique_id).

  • failures is a list of uids of runs that raised Exceptions. (The relevant tracebacks are logged.)

  • file_uids is a dictionary of RunStart unique IDs mapped to a set of (root, filename) pairs.

databroker_pack.export_uids(source_catalog, uids, directory, *, strict=False, external=None, no_documents=False, handler_registry=None, serializer_class=None, salt=None)[source]

Export Runs from a Catalog, given a list of RunStart unique IDs.

Parameters
source_catalog: Catalog
uids: List[Str]

List of RunStart unique IDs

directory: Union[Str, Manager]

Where files containing documents will be written, or a Manager for writing to non-file buffers.

strict: Bool, optional

By default, swallow erros and return a list of them at the end. Set to True to debug errors.

external: {None, ‘fill’, ‘ignore’)

If None, return the paths to external files. If ‘fill’, fill the external data into the Documents. If ‘ignore’, do not locate external files.

no_documents: Bool, optional

If True, do not write any files. False by default.

handler_registry: Union[Dict, None]

If None, automatic handler discovery is used.

serializer_class: Serializer

Expected to be a lossless serializer that encodes a format for which there is a corresponding databroker intake driver. Default (None) is currently suitcase.msgpack.Serializer, but this may change in the future. If you want suitcase.msgpack.Serializer specifically, pass it in explicitly.

salt: Union[bytes, None]

We want to make hashes is unique to:

  • a root

  • a given batch of exported runs (i.e., a given call to this function)

so that we can use it as a key in root_map which is guaranteed not to collide with keys from other batches. Thus, we create a “salt” unless one is specified here. This does not need to be cryptographically secure, just unique.

Returns
artifacts, files, failures, file_uids

Notes

  • artifacts maps a human-readable string (typically just 'all' in this case) to a list of buffers or filepaths where the documents were serialized.

  • files is the set of filepaths of all external files referenced by Resource documents, keyed on (root_in_document, root, unique_id).

  • failures is a list of uids of runs that raised Exceptions. (The relevant tracebacks are logged.)

  • file_uids is a dictionary of RunStart unique IDs mapped to a set of (root, filename) pairs.

databroker_pack.export_run(run, directory, root_hash_func, *, external=None, no_documents=False, handler_registry=None, root_map=None, serializer_class=None)[source]

Export one Run.

Parameters
run: BlueskyRun
directory: Union[Str, Manager]

Where files containing documents will be written, or a Manager for writing to non-file buffers.

external: {None, ‘fill’, ‘ignore’)

If None, return the paths to external files. If ‘fill’, fill the external data into the Documents. If ‘ignore’, do not locate external files.

no_documents: Bool, optional

If True, do not serialize documents. False by default.

handler_registry: Union[Dict, None]

If None, automatic handler discovery is used.

serializer_class: Serializer, optional

Expected to be a lossless serializer that encodes a format for which there is a corresponding databroker intake driver. Default (None) is currently suitcase.msgpack.Serializer, but this may change in the future. If you want suitcase.msgpack.Serializer specifically, pass it in explicitly.

Returns
artifacts, files

Notes

  • artifacts maps a human-readable string (typically just 'all' in this case) to a list of buffers or filepaths where the documents were serialized.

  • files is the set of filepaths of all external files referenced by Resource documents, keyed on (root_in_document, root, unique_id).

databroker_pack.copy_external_files(target_directory, root, unique_id, files, strict=False)[source]

Make a filesystem copy of the external files.

A filesystem copy is not always applicable/desirable. Use the external_file_manifest_*.txt files to feed other file transfer mechanisms, such as rsync or globus.

This is a wrapper around shutil.copyfile.

Parameters
target_directory: Union[Str, Path]
root: Str
files: Iterable[Str]
strict: Bool, optional

By default, swallow erros and return a list of them at the end. Set to True to debug errors.

Returns
new_root, new_files, failures

Notes

  • new_root is a Path to the new root directory

  • new_files is the list of filepaths to the files that were created.

  • failures is a list of uids of runs that raised Exceptions. (The relevant tracebacks are logged.)

databroker_pack.unpack_inplace(path, catalog_name, merge=False)[source]

Place a catalog configuration file in the user configuration area.

Parameters
path: Path

Path to output from pack

catalog_name: Str

A unique name for the catalog

merge: Boolean, optional

Unpack into an existing catalog

Returns
config_path: Path

Location of new catalog configuration file

databroker_pack.unpack_mongo_normalized(path, uri, catalog_name, merge=False)[source]

Place a catalog configuration file in the user configuration area.

Parameters
path: Path

Path to output from pack

uri: Str

MongoDB URI. Must include a database name. Example: mongodb://localhost:27017/databroker_unpack_my_catalog

catalog_name: Str

A unique name for the catalog

merge: Boolean, optional

Unpack into an existing catalog

Returns
config_path: Path

Location of new catalog configuration file

databroker_pack.write_documents_manifest(manager, directory, artifacts)[source]

Wirte the paths to all the files of Documents relative to the pack directory.

Parameters
manager: suitcase Manager object
directory: Str

Pack directory

artifacts: List[Str]
databroker_pack.write_external_files_manifest(manager, unique_id, files)[source]

Write a manifest of external files.

Parameters
manager: suitcase Manager object
unique_id: Str
files: Iterable[Union[Str, Path]]
databroker_pack.write_jsonl_catalog_file(manager, directory, paths, root_map)[source]

Write a YAML file with configuration for an intake catalog.

Parameters
manager: suitcase Manager object
directory: Str

Directory to which paths below are relative

paths: Union[Str, List[Str]]

Relative (s) of JSONL files encoding Documents.

root_map: Dict
databroker_pack.write_msgpack_catalog_file(manager, directory, paths, root_map)[source]

Write a YAML file with configuration for an intake catalog.

Parameters
manager: suitcase Manager object
directory: Str

Directory to which paths below are relative

paths: Union[Str, List[Str]]

Relative (s) of JSONL files encoding Documents.

root_map: Dict

What kinds of files are in the “pack”?

Data Broker is emphatically not a “data store”, but rather a Python library for interacting with potentially any data store from a unified Python interface that hands the user standard Python objects—-dictionaries, arrays, and other data structures widely used in the scientific Python ecosystems. It aims to abstract over the necessary variety in file formats across different domains, techniques, and instruments.

That said, it is sometimes necessary to take a look under the hood. The pack directory always contains:

  • A directory named documents containing either msgpack (binary) or JSONL (plaintext) files containing the Bluesky Documents.

  • Text manifests listing the names of these files relative to the directory root. The manifests maybe split over multiple files named like documents_manfiest_N.txt to facilitate compressing and transferring in chunks.

If the Documents reference external files—typically large array data written by detectors—these files may…

  • Have their contents filled directly into the Documents, and thus included in the msgpack or JSONL. This is blunt but simple.

  • Be listed in text manifests named like external_files_manfiest_HASH_N.txt. These manifests are suitable for feeding to tools to transfer large files in bulk, such as rsync or globus transfer --batch

  • Bundled into the pack directory in their original formats in directories named external_files/HASH/.

The advantage of the first approach is that the recipient does not need special I/O libraries installed to read the large array data. The advantage of the second and third approaches is that loading the large array data can be deferred.

The first and third approaches create self-contained directories, but the second approach facilitates more efficient means of transferring large amounts of data.