Export Data
===========

In this tutorial we will export data from a Run to files. We will do this in
two ways:

* The simple way, using methods like ``to_csv`` provided by standard scientific
  Python tools
* The "streaming" way, using Bluesky's Suitcases

Set up for Tutorial
-------------------

Before you begin, install ``databroker`` and ``databroker-pack``, following the
:doc:`install`.

Start your favorite interactive Python environment, such as ``ipython`` or
``jupyter lab``.

For this tutorial, we'll use a catalog of publicly available, openly licensed
sample data. Specifically, it is high-quality transmission XAS data from all
over the periodical table.

This utility downloads it and makes it discoverable to Databroker.

.. ipython:: python

   import databroker.tutorial_utils
   databroker.tutorial_utils.fetch_BMM_example()

Access the catalog as assign it to a variable for convenience.

.. ipython:: python

   import databroker
   catalog = databroker.catalog['bluesky-tutorial-BMM']

Let's take a Run from this Catalog.

.. ipython:: python

   run = catalog[23463]

What's in the Run?
------------------

The Run's "pretty display", shown by IPython and Jupyter and some other
similar tools, shows us a summary.

.. ipython:: python

   run

Each run contains logical "tables" of data called *streams*. We can see them in
the summary above, and we iterate over them programmatically with a ``for``
loop or with ``list``.

.. ipython:: python

   list(run)

Simple Export
-------------

Export to CSV or Excel
^^^^^^^^^^^^^^^^^^^^^^

CSV can be suitable small amounts of scalar data. It's not fast and it's not
particularly good way to store numeric data or rich metadata---but it is
universally understood and human-readable.

Here, we look at the columns in the primary stream and choose some to export to
CSV.

.. ipython:: python

   ds = run.primary.read()
   ds
   columns = ["I0", "It", "Ir", "dcm_energy"]  # columns we want to export
   df = ds[columns].to_dataframe()
   df
   # Setting index=False omits the "time" index on the left from the output.
   df.to_csv("data.csv", index=False)

If you target is to get data into Excel, note that you can write Excel files
directly. This requires an additional dependency that you may not already have
installed.

.. code:: python

   # Install Excel writer used by pandas using pip...
   pip install openpyxl
   # or conda...
   conda install -c conda-forge openpyxl

.. ipython:: python

   df.to_excel("data.xlsx", index=False)

Both of these methods have a large number of options to customize the output.
Use ``df.to_csv?`` (IPython, Jupyter) or ``help(df.to_csv)`` to learn more.
Likesie for ``df.to_excel``.

If you have many runs to do in batch, you may use the metadata to automatically
generate filenames. It is strongly recommended to include part of the globally
unique id, ``uid``, at the end to ensure that names do not clash and overwrite.

.. ipython:: python

   columns = ["I0", "It", "Ir", "dcm_energy"]
   results = catalog.search({"XDI.Element.symbol": "Mn"})
   for uid, run in results.items():
       ds = run.primary.read()
       df = ds[columns].to_dataframe()
       # Generate filename from metadata.
       md = run.metadata["start"]
       filename = f'Mn-spectra-{md["scan_id"]}-{md["uid"]:.8}.csv'
       df.to_csv(filename, index=False)
       print(f"Saved {filename}")

Export to HDF5
^^^^^^^^^^^^^^

HDF5 is suitable for image data. It is understood by most data analysis
software.

.. note::

   This example uses h5py.

   .. code::
   
      conda install h5py
   
      # or...
   
      pip install h5py

.. ipython:: python

   import h5py

   ds = run.primary.read()
   columns = ["I0", "It", "Ir", "dcm_energy"]  # columns we want to export
   with h5py.File("data.h5", "w") as file:
       for column in columns:
           file[column] = df[column]

Streaming Export
----------------

A tool built for streaming export can be used on both saved data (as we'll do
here) and on live-streaming data during data acquisition.

.. note::

   This example uses suitcase-csv.

   .. code::
   
      conda install -c nsls2forge suitcase-csv
   
      # or...
   
      pip install suitcase-csv

.. ipython:: python
   :okexcept:

   import suitcase.csv
   artifacts = suitcase.csv.export(run.documents(fill="yes"), "output_directory")
   artifacts

Note that this operates on the entire `run` and all of its streams. When a Run
contains multiple streams, multiple CSV files will be created. This is why it
acceps a path to a *directory* rather than a path to a single file. Any data
that does well-suited to the format (e.g. image data in this case) is omitted
for the export.

See `Suitcase`_ for a list of supported formats and more information.

.. _Suitcase: https://blueskyproject.io/suitcase

.. ipython:: python
   :suppress:

   # Clean up
   !rm data.csv
   !rm -rf Mn-spectra*
   !rm data.xlsx
   !rm data.h5
   !rm -rf output_directory