Export Data¶
In this tutorial we will export data from a Run to files. We will do this in two ways:
The simple way, using methods like
to_csv
provided by standard scientific Python toolsThe “streaming” way, using Bluesky’s Suitcases
Set up for Tutorial¶
Before you begin, install databroker
and databroker-pack
, following the
Installation Tutorial.
Start your favorite interactive Python environment, such as ipython
or
jupyter lab
.
For this tutorial, we’ll use a catalog of publicly available, openly licensed sample data. Specifically, it is high-quality transmission XAS data from all over the periodical table.
This utility downloads it and makes it discoverable to Databroker.
In [1]: import databroker.tutorial_utils
In [2]: databroker.tutorial_utils.fetch_BMM_example()
Access the catalog as assign it to a variable for convenience.
In [3]: import databroker
In [4]: catalog = databroker.catalog['bluesky-tutorial-BMM']
Let’s take a Run from this Catalog.
In [5]: run = catalog[23463]
What’s in the Run?¶
The Run’s “pretty display”, shown by IPython and Jupyter and some other similar tools, shows us a summary.
In [6]: run
Out[6]:
BlueskyRun
uid='4393404b-8986-4c75-9a64-d7f6949a9344'
exit_status='success'
2020-03-07 10:29:49.483 -- 2020-03-07 10:41:20.546
Streams:
* baseline
* primary
Each run contains logical “tables” of data called streams. We can see them in
the summary above, and we iterate over them programmatically with a for
loop or with list
.
In [7]: list(run)
Out[7]: ['baseline', 'primary']
Simple Export¶
Export to CSV or Excel¶
CSV can be suitable small amounts of scalar data. It’s not fast and it’s not particularly good way to store numeric data or rich metadata—but it is universally understood and human-readable.
Here, we look at the columns in the primary stream and choose some to export to CSV.
In [8]: ds = run.primary.read()
In [9]: ds
Out[9]:
<xarray.Dataset>
Dimensions: (time: 411)
Coordinates:
* time (time) float64 1.584e+09 1.584e+09 ... 1.584e+09
Data variables:
I0 (time) float64 135.2 135.0 134.8 ... 99.55 99.31
It (time) float64 123.2 123.3 123.5 ... 38.25 38.27
Ir (time) float64 55.74 55.98 56.2 ... 9.683 9.711
dwti_dwell_time (time) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
dwti_dwell_time_setpoint (time) float64 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0
dcm_energy (time) float64 8.133e+03 8.143e+03 ... 9.075e+03
dcm_energy_setpoint (time) float64 8.133e+03 8.143e+03 ... 9.075e+03
In [10]: columns = ["I0", "It", "Ir", "dcm_energy"] # columns we want to export
In [11]: df = ds[columns].to_dataframe()
In [12]: df
Out[12]:
I0 It Ir dcm_energy
time
1.583577e+09 135.166441 123.161731 55.739678 8132.815659
1.583577e+09 134.979163 123.342974 55.977293 8142.810017
1.583577e+09 134.785572 123.504594 56.202308 8152.849665
1.583577e+09 134.635738 123.709273 56.458782 8162.843543
1.583577e+09 134.702942 124.117268 56.789872 8172.831253
... ... ... ... ...
1.583578e+09 100.425114 38.201779 9.591793 9053.740421
1.583578e+09 100.171185 38.236501 9.629766 9058.972209
1.583578e+09 98.829689 37.850528 9.557539 9064.259744
1.583578e+09 99.550269 38.245548 9.683483 9069.553609
1.583578e+09 99.306863 38.267523 9.711234 9074.853815
[411 rows x 4 columns]
# Setting index=False omits the "time" index on the left from the output.
In [13]: df.to_csv("data.csv", index=False)
If you target is to get data into Excel, note that you can write Excel files directly. This requires an additional dependency that you may not already have installed.
# Install Excel writer used by pandas using pip...
pip install openpyxl
# or conda...
conda install -c conda-forge openpyxl
In [14]: df.to_excel("data.xlsx", index=False)
Both of these methods have a large number of options to customize the output.
Use df.to_csv?
(IPython, Jupyter) or help(df.to_csv)
to learn more.
Likesie for df.to_excel
.
If you have many runs to do in batch, you may use the metadata to automatically
generate filenames. It is strongly recommended to include part of the globally
unique id, uid
, at the end to ensure that names do not clash and overwrite.
In [15]: columns = ["I0", "It", "Ir", "dcm_energy"]
In [16]: results = catalog.search({"XDI.Element.symbol": "Mn"})
In [17]: for uid, run in results.items():
....: ds = run.primary.read()
....: df = ds[columns].to_dataframe()
....: md = run.metadata["start"]
....: filename = f'Mn-spectra-{md["scan_id"]}-{md["uid"]:.8}.csv'
....: df.to_csv(filename, index=False)
....: print(f"Saved {filename}")
....:
Saved Mn-spectra-22668-3c3da7c8.csv
Saved Mn-spectra-23457-292e1bcb.csv
Saved Mn-spectra-22670-c353bdda.csv
Saved Mn-spectra-22671-776a03b3.csv
Saved Mn-spectra-23726-12a63104.csv
Saved Mn-spectra-22669-647b0065.csv
Export to HDF5¶
HDF5 is suitable for image data. It is understood by most data analysis software.
Note
This example uses h5py.
conda install h5py
# or...
pip install h5py
In [18]: import h5py
In [19]: ds = run.primary.read()
In [20]: columns = ["I0", "It", "Ir", "dcm_energy"] # columns we want to export
In [21]: with h5py.File("data.h5", "w") as file:
....: for column in columns:
....: file[column] = df[column]
....:
Streaming Export¶
A tool built for streaming export can be used on both saved data (as we’ll do here) and on live-streaming data during data acquisition.
Note
This example uses suitcase-csv.
conda install -c nsls2forge suitcase-csv
# or...
pip install suitcase-csv
In [22]: import suitcase.csv
In [23]: artifacts = suitcase.csv.export(run.documents(fill="yes"), "output_directory")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 artifacts = suitcase.csv.export(run.documents(fill="yes"), "output_directory")
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/csv/__init__.py:87, in export(gen, directory, file_prefix, **kwargs)
85 with Serializer(directory, file_prefix, **kwargs) as serializer:
86 for item in gen:
---> 87 serializer(*item)
89 return serializer.artifacts
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/csv/__init__.py:294, in Serializer.__exit__(self, *exception_details)
293 def __exit__(self, *exception_details):
--> 294 self.close()
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/csv/__init__.py:288, in Serializer.close(self)
285 def close(self):
286 '''Close all of the files opened by this Serializer.
287 '''
--> 288 self._manager.close()
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/utils/__init__.py:266, in MultiFileManager.close(self)
264 for a in self._artifacts:
265 if a.handle is not None:
--> 266 a.handle.close()
File /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/suitcase/utils/__init__.py:93, in Artifact.handle.<locals>.update_size_on_close.<locals>.wrapped_close()
92 def wrapped_close():
---> 93 handle.seek(0, os.SEEK_END)
94 self._final_size = handle.tell()
95 orig_close()
ValueError: I/O operation on closed file.
In [24]: artifacts
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [24], in <cell line: 1>()
----> 1 artifacts
NameError: name 'artifacts' is not defined
Note that this operates on the entire run and all of its streams. When a Run contains multiple streams, multiple CSV files will be created. This is why it acceps a path to a directory rather than a path to a single file. Any data that does well-suited to the format (e.g. image data in this case) is omitted for the export.
See Suitcase for a list of supported formats and more information.