Export Data

In this tutorial we will export data from a Run to files. We will do this in two ways:

  • The simple way, using methods like to_csv provided by standard scientific Python tools

  • The “streaming” way, using Bluesky’s Suitcases

Set up for Tutorial

Before you begin, install databroker and databroker-pack, following the Installation Tutorial.

Start your favorite interactive Python environment, such as ipython or jupyter lab.

For this tutorial, we’ll use a catalog of publicly available, openly licensed sample data. Specifically, it is high-quality transmission XAS data from all over the periodical table.

This utility downloads it and makes it discoverable to Databroker.

In [1]: import databroker.tutorial_utils

In [2]: databroker.tutorial_utils.fetch_BMM_example()

Access the catalog as assign it to a variable for convenience.

In [3]: import databroker

In [4]: catalog = databroker.catalog['bluesky-tutorial-BMM']

Let’s take a Run from this Catalog.

In [5]: run = catalog[23463]

What’s in the Run?

The Run’s “pretty display”, shown by IPython and Jupyter and some other similar tools, shows us a summary.

In [6]: run
  2020-03-07 10:29:49.483 -- 2020-03-07 10:41:20.546
    * baseline
    * primary

Each run contains logical “tables” of data called streams. We can see them in the summary above, and we iterate over them programmatically with a for loop or with list.

In [7]: list(run)
Out[7]: ['baseline', 'primary']

Simple Export

Export to CSV or Excel

CSV can be suitable small amounts of scalar data. It’s not fast and it’s not particularly good way to store numeric data or rich metadata—but it is universally understood and human-readable.

Here, we look at the columns in the primary stream and choose some to export to CSV.

In [8]: ds = run.primary.read()

In [9]: ds
Dimensions:                   (time: 411)
  * time                      (time) float64 1.584e+09 1.584e+09 ... 1.584e+09
Data variables:
    I0                        (time) float64 135.2 135.0 134.8 ... 99.55 99.31
    It                        (time) float64 123.2 123.3 123.5 ... 38.25 38.27
    Ir                        (time) float64 55.74 55.98 56.2 ... 9.683 9.711
    dwti_dwell_time           (time) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
    dwti_dwell_time_setpoint  (time) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
    dcm_energy                (time) float64 8.133e+03 8.143e+03 ... 9.075e+03
    dcm_energy_setpoint       (time) float64 8.133e+03 8.143e+03 ... 9.075e+03

In [10]: columns = ["I0", "It", "Ir", "dcm_energy"]  # columns we want to export

In [11]: df = ds[columns].to_dataframe()

In [12]: df
                      I0          It         Ir   dcm_energy
1.583577e+09  135.166441  123.161731  55.739678  8132.815659
1.583577e+09  134.979163  123.342974  55.977293  8142.810017
1.583577e+09  134.785572  123.504594  56.202308  8152.849665
1.583577e+09  134.635738  123.709273  56.458782  8162.843543
1.583577e+09  134.702942  124.117268  56.789872  8172.831253
...                  ...         ...        ...          ...
1.583578e+09  100.425114   38.201779   9.591793  9053.740421
1.583578e+09  100.171185   38.236501   9.629766  9058.972209
1.583578e+09   98.829689   37.850528   9.557539  9064.259744
1.583578e+09   99.550269   38.245548   9.683483  9069.553609
1.583578e+09   99.306863   38.267523   9.711234  9074.853815

[411 rows x 4 columns]

# Setting index=False omits the "time" index on the left from the output.
In [13]: df.to_csv("data.csv", index=False)

If you target is to get data into Excel, note that you can write Excel files directly. This requires an additional dependency that you may not already have installed.

# Install Excel writer used by pandas using pip...
pip install openpyxl
# or conda...
conda install -c conda-forge openpyxl
In [14]: df.to_excel("data.xlsx", index=False)

Both of these methods have a large number of options to customize the output. Use df.to_csv? (IPython, Jupyter) or help(df.to_csv) to learn more. Likesie for df.to_excel.

If you have many runs to do in batch, you may use the metadata to automatically generate filenames. It is strongly recommended to include part of the globally unique id, uid, at the end to ensure that names do not clash and overwrite.

In [15]: columns = ["I0", "It", "Ir", "dcm_energy"]

In [16]: results = catalog.search({"XDI.Element.symbol": "Mn"})

In [17]: for uid, run in results.items():
   ....:     ds = run.primary.read()
   ....:     df = ds[columns].to_dataframe()
   ....:     md = run.metadata["start"]
   ....:     filename = f'Mn-spectra-{md["scan_id"]}-{md["uid"]:.8}.csv'
   ....:     df.to_csv(filename, index=False)
   ....:     print(f"Saved {filename}")
Saved Mn-spectra-22668-3c3da7c8.csv
Saved Mn-spectra-23457-292e1bcb.csv
Saved Mn-spectra-22670-c353bdda.csv
Saved Mn-spectra-22671-776a03b3.csv
Saved Mn-spectra-23726-12a63104.csv
Saved Mn-spectra-22669-647b0065.csv

Export to HDF5

HDF5 is suitable for image data. It is understood by most data analysis software.


This example uses h5py.

conda install h5py

# or...

pip install h5py
In [18]: import h5py

In [19]: ds = run.primary.read()

In [20]: columns = ["I0", "It", "Ir", "dcm_energy"]  # columns we want to export

In [21]: with h5py.File("data.h5", "w") as file:
   ....:     for column in columns:
   ....:         file[column] = df[column]

Streaming Export

A tool built for streaming export can be used on both saved data (as we’ll do here) and on live-streaming data during data acquisition.


This example uses suitcase-csv.

conda install -c nsls2forge suitcase-csv

# or...

pip install suitcase-csv
In [22]: import suitcase.csv

In [23]: artifacts = suitcase.csv.export(run.documents(fill="yes"), "output_directory")
ValueError                                Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 artifacts = suitcase.csv.export(run.documents(fill="yes"), "output_directory")

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/csv/__init__.py:87, in export(gen, directory, file_prefix, **kwargs)
     85 with Serializer(directory, file_prefix, **kwargs) as serializer:
     86     for item in gen:
---> 87         serializer(*item)
     89 return serializer.artifacts

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/csv/__init__.py:294, in Serializer.__exit__(self, *exception_details)
    293 def __exit__(self, *exception_details):
--> 294     self.close()

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/csv/__init__.py:288, in Serializer.close(self)
    285 def close(self):
    286     '''Close all of the files opened by this Serializer.
    287     '''
--> 288     self._manager.close()

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/utils/__init__.py:266, in MultiFileManager.close(self)
    264 for a in self._artifacts:
    265     if a.handle is not None:
--> 266         a.handle.close()

File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/utils/__init__.py:93, in Artifact.handle.<locals>.update_size_on_close.<locals>.wrapped_close()
     92 def wrapped_close():
---> 93     handle.seek(0, os.SEEK_END)
     94     self._final_size = handle.tell()
     95     orig_close()

ValueError: I/O operation on closed file.

In [24]: artifacts
NameError                                 Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 artifacts

NameError: name 'artifacts' is not defined

Note that this operates on the entire run and all of its streams. When a Run contains multiple streams, multiple CSV files will be created. This is why it acceps a path to a directory rather than a path to a single file. Any data that does well-suited to the format (e.g. image data in this case) is omitted for the export.

See Suitcase for a list of supported formats and more information.