Reference¶
Python API¶
- databroker_pack.export_catalog(source_catalog, directory, *, strict=False, external=None, no_documents=False, handler_registry=None, serializer_class=None, salt=None, limit=None)[source]¶
Export all the Runs from a Catalog.
- Parameters
- source_catalog: Catalog
- directory: Union[Str, Manager]
Where files containing documents will be written, or a Manager for writing to non-file buffers.
- strict: Bool, optional
By default, swallow erros and return a list of them at the end. Set to True to debug errors.
- external: {None, ‘fill’, ‘ignore’)
If None, return the paths to external files. If ‘fill’, fill the external data into the Documents. If ‘ignore’, do not locate external files.
- no_documents: Bool, optional
If True, do not serialize documents. False by default.
- handler_registry: Union[Dict, None]
If None, automatic handler discovery is used.
- serializer_class: Serializer
Expected to be a lossless serializer that encodes a format for which there is a corresponding databroker intake driver. Default (None) is currently
suitcase.msgpack.Serializer
, but this may change in the future. If you wantsuitcase.msgpack.Serializer
specifically, pass it in explicitly.- salt: Union[bytes, None]
We want to make hashes is unique to:
a root
a given batch of exported runs (i.e., a given call to this function)
so that we can use it as a key in root_map which is guaranteed not to collide with keys from other batches. Thus, we create a “salt” unless one is specified here. This does not need to be cryptographically secure, just unique.
- limit: Union[Integer, None]
Stop after exporting some number of Runs. Useful for testing a subset before doing a lengthy export.
- Returns
- artifacts, files, failures, file_uids
Notes
artifacts
maps a human-readable string (typically just'all'
in this case) to a list of buffers or filepaths where the documents were serialized.files
is the set of filepaths of all external files referenced by Resource documents, keyed on(root_in_document, root, unique_id)
.failures
is a list of uids of runs that raised Exceptions. (The relevant tracebacks are logged.)file_uids
is a dictionary of RunStart unique IDs mapped to a set of(root, filename)
pairs.
- databroker_pack.export_uids(source_catalog, uids, directory, *, strict=False, external=None, no_documents=False, handler_registry=None, serializer_class=None, salt=None)[source]¶
Export Runs from a Catalog, given a list of RunStart unique IDs.
- Parameters
- source_catalog: Catalog
- uids: List[Str]
List of RunStart unique IDs
- directory: Union[Str, Manager]
Where files containing documents will be written, or a Manager for writing to non-file buffers.
- strict: Bool, optional
By default, swallow erros and return a list of them at the end. Set to True to debug errors.
- external: {None, ‘fill’, ‘ignore’)
If None, return the paths to external files. If ‘fill’, fill the external data into the Documents. If ‘ignore’, do not locate external files.
- no_documents: Bool, optional
If True, do not write any files. False by default.
- handler_registry: Union[Dict, None]
If None, automatic handler discovery is used.
- serializer_class: Serializer
Expected to be a lossless serializer that encodes a format for which there is a corresponding databroker intake driver. Default (None) is currently
suitcase.msgpack.Serializer
, but this may change in the future. If you wantsuitcase.msgpack.Serializer
specifically, pass it in explicitly.- salt: Union[bytes, None]
We want to make hashes is unique to:
a root
a given batch of exported runs (i.e., a given call to this function)
so that we can use it as a key in root_map which is guaranteed not to collide with keys from other batches. Thus, we create a “salt” unless one is specified here. This does not need to be cryptographically secure, just unique.
- Returns
- artifacts, files, failures, file_uids
Notes
artifacts
maps a human-readable string (typically just'all'
in this case) to a list of buffers or filepaths where the documents were serialized.files
is the set of filepaths of all external files referenced by Resource documents, keyed on(root_in_document, root, unique_id)
.failures
is a list of uids of runs that raised Exceptions. (The relevant tracebacks are logged.)file_uids
is a dictionary of RunStart unique IDs mapped to a set of(root, filename)
pairs.
- databroker_pack.export_run(run, directory, root_hash_func, *, external=None, no_documents=False, handler_registry=None, root_map=None, serializer_class=None)[source]¶
Export one Run.
- Parameters
- run: BlueskyRun
- directory: Union[Str, Manager]
Where files containing documents will be written, or a Manager for writing to non-file buffers.
- external: {None, ‘fill’, ‘ignore’)
If None, return the paths to external files. If ‘fill’, fill the external data into the Documents. If ‘ignore’, do not locate external files.
- no_documents: Bool, optional
If True, do not serialize documents. False by default.
- handler_registry: Union[Dict, None]
If None, automatic handler discovery is used.
- serializer_class: Serializer, optional
Expected to be a lossless serializer that encodes a format for which there is a corresponding databroker intake driver. Default (None) is currently
suitcase.msgpack.Serializer
, but this may change in the future. If you wantsuitcase.msgpack.Serializer
specifically, pass it in explicitly.
- Returns
- artifacts, files
Notes
artifacts
maps a human-readable string (typically just'all'
in this case) to a list of buffers or filepaths where the documents were serialized.files
is the set of filepaths of all external files referenced by Resource documents, keyed on(root_in_document, root, unique_id)
.
- databroker_pack.copy_external_files(target_directory, root, unique_id, files, strict=False)[source]¶
Make a filesystem copy of the external files.
A filesystem copy is not always applicable/desirable. Use the external_file_manifest_*.txt files to feed other file transfer mechanisms, such as rsync or globus.
This is a wrapper around shutil.copyfile.
- Parameters
- target_directory: Union[Str, Path]
- root: Str
- files: Iterable[Str]
- strict: Bool, optional
By default, swallow erros and return a list of them at the end. Set to True to debug errors.
- Returns
- new_root, new_files, failures
Notes
new_root
is a Path to the new root directorynew_files
is the list of filepaths to the files that were created.failures
is a list of uids of runs that raised Exceptions. (The relevant tracebacks are logged.)
- databroker_pack.unpack_inplace(path, catalog_name, merge=False)[source]¶
Place a catalog configuration file in the user configuration area.
- Parameters
- path: Path
Path to output from pack
- catalog_name: Str
A unique name for the catalog
- merge: Boolean, optional
Unpack into an existing catalog
- Returns
- config_path: Path
Location of new catalog configuration file
- databroker_pack.unpack_mongo_normalized(path, uri, catalog_name, merge=False)[source]¶
Place a catalog configuration file in the user configuration area.
- Parameters
- path: Path
Path to output from pack
- uri: Str
MongoDB URI. Must include a database name. Example:
mongodb://localhost:27017/databroker_unpack_my_catalog
- catalog_name: Str
A unique name for the catalog
- merge: Boolean, optional
Unpack into an existing catalog
- Returns
- config_path: Path
Location of new catalog configuration file
- databroker_pack.write_documents_manifest(manager, directory, artifacts)[source]¶
Wirte the paths to all the files of Documents relative to the pack directory.
- Parameters
- manager: suitcase Manager object
- directory: Str
Pack directory
- artifacts: List[Str]
- databroker_pack.write_external_files_manifest(manager, unique_id, files)[source]¶
Write a manifest of external files.
- Parameters
- manager: suitcase Manager object
- unique_id: Str
- files: Iterable[Union[Str, Path]]
- databroker_pack.write_jsonl_catalog_file(manager, directory, paths, root_map)[source]¶
Write a YAML file with configuration for an intake catalog.
- Parameters
- manager: suitcase Manager object
- directory: Str
Directory to which paths below are relative
- paths: Union[Str, List[Str]]
Relative (s) of JSONL files encoding Documents.
- root_map: Dict
- databroker_pack.write_msgpack_catalog_file(manager, directory, paths, root_map)[source]¶
Write a YAML file with configuration for an intake catalog.
- Parameters
- manager: suitcase Manager object
- directory: Str
Directory to which paths below are relative
- paths: Union[Str, List[Str]]
Relative (s) of JSONL files encoding Documents.
- root_map: Dict
What kinds of files are in the “pack”?¶
Data Broker is emphatically not a “data store”, but rather a Python library for interacting with potentially any data store from a unified Python interface that hands the user standard Python objects—-dictionaries, arrays, and other data structures widely used in the scientific Python ecosystems. It aims to abstract over the necessary variety in file formats across different domains, techniques, and instruments.
That said, it is sometimes necessary to take a look under the hood. The pack directory always contains:
A directory named
documents
containing either msgpack (binary) or JSONL (plaintext) files containing the Bluesky Documents.Text manifests listing the names of these files relative to the directory root. The manifests maybe split over multiple files named like
documents_manfiest_N.txt
to facilitate compressing and transferring in chunks.
If the Documents reference external files—typically large array data written by detectors—these files may…
Have their contents filled directly into the Documents, and thus included in the msgpack or JSONL. This is blunt but simple.
Be listed in text manifests named like
external_files_manfiest_HASH_N.txt
. These manifests are suitable for feeding to tools to transfer large files in bulk, such asrsync
orglobus transfer --batch
Bundled into the pack directory in their original formats in directories named
external_files/HASH/
.
The advantage of the first approach is that the recipient does not need special I/O libraries installed to read the large array data. The advantage of the second and third approaches is that loading the large array data can be deferred.
The first and third approaches create self-contained directories, but the second approach facilitates more efficient means of transferring large amounts of data.