How can it be used locally and remotely?
========================================

The bluesky ecosystem provides several modes for accessing data:

* Access Central DataBroker via a Generic Remote Client --- This includes
  Remote Desktop, Jupyter, and SSH.
* Portable DataBroker with Local Data --- Let users use ``databroker`` on their
  laptops and/or on servers at their home institutions, with all the relevant
  data copied locally and no need for a network connection.
* Portable DataBroker with Remote Data --- Let users use ``databroker`` on their
  laptops and/or on servers at their home institutions, pulling data from an
  HTTP server on demand, and optionally caching it locally.
* Traditional File Export --- Export data to files for existing software that
  expects files in a certain format named a certain way.


Access Central DataBroker via a Generic Remote Client
-----------------------------------------------------

In this mode, users do not install ``databroker`` locally. They use any remote
client---such as Remote Desktop, Jupyter, or SSH---to access a Python
environment on the source machine, and use ``databroker`` there, which
presumably has fast access to the data storage and some compute resources.


Portable DataBroker with Local Data
-----------------------------------

DataBroker is not itself a data store; it is a Python library for accessing
data across a variety of data stores. Therefore, it can be run on a laptop
without network connectivity, accessing data stored in ordinary files or in
a local database. Both are officially supported.

The process involves:

#. Identify a subset of the data to be copied locally from the source
   institution, given as a query (e.g. a time range) or a list of unique
   identifiers. Export the documents into a file-based format (typically
   msgpack). Copy any of the large "external" files (e.g. TIFF or HDF5 files
   generated by large detectors).
#. Transfer all of this to the target machine, perhaps via ``rsync`` or Globus.
   Place a configuration file discoverable by ``databroker`` that points to the
   location where the files were transferred.
#. Install the Python library ``databroker`` on the target machine using pip or
   conda.

DataBroker can work on top of a directory of ordinary files just fine; it even
supports the same queries that it would normally run on a database---just less
efficiently. Optionally, ingest the documents into a local database to support
more efficient queries.

The small utility
`databroker-pack <https://blueskyproject.io/databroker-pack>`_ streamlines the
process of "packing" some data from data broker into portable files and
"unpacking" them at their destination.

Portable DataBroker with Remote Data
------------------------------------

In this mode, data copying would happen invisibility to the user and only on
demand. The process involves:

#. Install the Python library ``databroker`` on the target machine using pip or
   conda.
#. Provide databroker with the URL of a remote "remote data catalog" running
   that the source facility.

The user experience from there is exactly the same where the data happens to be
local or remote. Thus, users could write code in one mode and seamless
transition to the other.

Data is downloaded on demand, and it may be cached locally so that it need not
be repeatedly downloaded. This requires a stable URL and a reliable network
connection. There are *no instances of this mode* known at this time, but all
the software pieces to achieve it exist. It is on the project roadmap.

Traditional File Export
-----------------------

Export the data to files (e.g. TIFFs and/or CSVs) with the metadata of your
choice encoded in filenames. This mode forfeits much of the power of databroker
and the bluesky ecosystem generally, but it is important for supporting
existing workflows and software that expects files in a certain format named a
certain way.

We expect this mode to become less useful as data sizes increase and scientific
software literacy grows over time. It is a bridge.

Streaming Export
^^^^^^^^^^^^^^^^

This means exporting the data during data acquisition such that partial results
are available for reading. The bluesky
`suitcase <https://blueskyproject.io/suitcase/>`_ project provides a pattern
for doing this and ready-to-use implementations for popular formats.

The streaming export tools may also be used after data acquisition.

Prompt Export
^^^^^^^^^^^^^

This means exporting the data at the end of data acquisition. (To be precise,
at the end of each "Bluesky Run". The scope of a "Run" is up to the details of
the data acquisition procedure.) This is typically much simpler than streaming
export and can be implemented *ad hoc* by accessing the data from databroker
and writing out a file using the relevant Python I/O library.