Bluesky Slides

The Bluesky Project Contributors

Bluesky originated at National Synchrontron Light Source II

("Giant X-ray beam")

NSLS-II is a "User Facility"

Scientific Staff spend 20% time on experiments, the rest on user support
Users mail samples or visit for ~1–10 days
28 active instruments ("beamlines"), 60-70 planned

Motivation

Build software that enables collaboration and specialization

Each beamline is one of a kind by design, so no software can be a complete solution for everyone.
Aim to enable beamlines to share yet also support their unique needs.
Use software design patterns that encourage building on a shared core.

Bluesky is designed in service to data analysis

When analyzing data we want....

To easily find the data we're looking for.
Access to that data, not particularly caring where it's stored or in what file format.
Well-structured data marked up with relevant context, to support easy and sometimes automated batch analysis.
Seamless integration with popular data analysis tools.

Bluesky may be used from IPython or from graphical user interfaces

At first we targeted command-line usage in IPython, thinking of users coming from SPEC.
Staff at our facility and at other facilities have built their own graphical applications that interface to parts of Bluesky. (See following slides.)
We are developing a toolbox of reusable components for building graphical interfaces for data acquisition, search, access, and visualization, supporting:
- Desktop applications
- Jupyter
- Web applications

Bluesky is a bridge to the open-source ecosystem

Figure Credit: "The Unexpected Effectiveness of Python in Science", PyCon 2017

Bluesky is written in Python, which is very popular

Figure Credit: Stack Overflow Blog https://stackoverflow.blog/2017/09/06/incredible-growth-python/

Bluesky is designed for the long term

Make it easy to keep file-reading and -writing code separate from scientific code.
Support streaming (live-updating) visualization and processing, and adaptive experiment steering.
Integrate well with web technologies and be cloud-friendly.
But also meet users where they are!

Bluesky has individually useful core components

Bluesky Run Engine: experiment orchestration and data acquisition
- Emits data and metadata in a streaming fashion as it's available during acquisition
- Centrally manages any interruptions (pause/resume), failure modes, etc.
Bluesky Plans: experiment sequences (e.g. scans)
- Designed with adaptive sequences in mind from the start
- Leaves room for hardware-triggered scanning — interesting new work happening here
Ophyd: interface to hardware
Suitcases
- streaming export to popular formats (e.g. CSV, TIFF, …. NeXus support in progress)
- or streaming save lossless storage (e.g. MongoDB, msgpack — plus external large arrays)
Tiled and Databroker: search and access saved data

Other facilities have adopted Bluesky piecemeal, adapting, extending, or replacing components to meet their requirements.

List of facilities known to use Bluesky (1 of 2)

NSLS-II (used at 26/28 beamlines)
LCLS (widespread use) and SSRL (one or two instruments)
APS (scaling up from a couple beamlines to dozens)
Australian Synchrotron (several beamlines)
ALS (at least one beamline, also scaling up)
Diamond (evaluating, but has made significant development investments)

List of facilities known to use Bluesky (2 of 2)

Canadian Light Source (at least one)
PSI (evaluating, not yet committed to adoption)
Pohang Light Source II
Various academic labs
BESSY II

User Facilities Have a Data Problem

We can learn a lot from particle physics, astronomy, and climate science....but we have some unique problems too.

What changed to make data problems harder?

Sources got brighter; detectors got larger and faster: greater data velocity and volume.
This exposes the variety problem we have at user facilities:
- Large and changing collection of instruments
- Wide span of data rates, structures, and access patterns
- Mix of well-established data processing procedures and original, improvised techniques
Multi-modal analysis makes this an N^2, ... problem.

"Big data is whatever is larger
than your field is used to."

A spot check for data volume at an NSLS-II Project Beamline so far...

What changed to make data problems easier?

HPC is becoming more accessible.

One inviting example: jupyter.nersc.gov

Jupyter as a familiar, user-friendly portal
Dask for familiar numpy/pandas idioms distributed over many nodes

Also: Commodity cloud-based tools

Lately it's become more practical to work openly and collaboratively....

across instruments within a facility
between facilities
with outside communities with similar data problems (e.g. climate science)

...which is not a new idea, but ease-of-use matters.

Status Quo:
Data and Metadata are Scattered

Some critical context is only in people's heads
Many file formats (tif, cbf, Nexus, other HDF5, proprietary, ...)
meta_data_in_37K_fname_005_NaCl_cal.tif
"Magic numbers" buried in analysis tools
Notes in paper notebooks

What's the problem?

Not machine-readable or searchable
Relationship between any two pieces of data unclear
Inhibits multi-modal work
Inhibits code reuse
Not streaming friendly

What do we need to systematically track?

Experimental Data

Analysis needs more than "primary" data stream:

Timestamps
Secondary measurements
"Fixed" experimental values
Calibration / beam-line configuration data
Hardware settings
Hardware diagnostics
Physical details of the hardware

Sample Data

What is the sample?
What is the contrast mechanism?
Why are we looking at it?
How was it prepared?

Bureaucratic & Management Information

Where is the data and how to get it?
Who took the data?
Who owns or can access the data?
How long will we keep the data?

Design Goals

both technical and sociological

for an end-to-end data acquisition and analysis solution that leverages data science libraries

Technical Goals

Generic across science domains
Lightweight
Put metadata in a predictable place
Handle asynchronous data streams
Support multi-modal: simultaneous, cross-beamline, cross-facility
Support streaming
Cloud friendly
Integrate with third-party (meta)data sources

Sociological Goals

Overcome "not-invented-here"-ism.
Make co-developed but separately useful components with well-defined boundaries which can be adopted piecemeal by other facilities.
Drawing inspiration from the numpy project, embrace protocols and interfaces for interoperability.

Bluesky is designed for Distributed Collaboration

Facilities and instruments within a facility can share common components and benefit from a share knowledge base and a shared code base
While also having room to innovate and specialize to suit their own priorities and timelines.
The scientific Python community is an example of how this can work well.
Use the parts of Bluesky that work for you a la carte, extend them, or replace them.

Bluesky is designed for Distributed Collaboration (cont.)

This is not an all-or-nothing framework that you have to buy into; it's a mini-ecosystem of co-developed but individually useful tools that you can build on.
It's all in Python. Some beamline staff and partner users have built on it.

Bluesky Architecture

Layered design of Python libraries that are:

co-developed and compatible...
...but individually usable and useful
with well-defined programmatic interfaces

Looking at each component, from the bottom up....

Device Drivers and Underlying Control Layer(s)

You might have a pile of hardware that communicates over one or more of:

Experimental Physics and Industrial Control System (EPICS)
LabView
Some other standard
Some vendor-specific, one-off serial or socket protocol

Ophyd abstracts over the specific control layer.

Ophyd: a hardware abstraction layer

Put the control layer behind a high-level interface with methods like trigger(), read(), and set(...).
Group individual signals into logical "Devices" to be configured and used as one unit.
Assign signals and devices human-friendly names that propagate into metadata.
Categorize signals by "kind" (primary reading, configuration, engineering/debugging).

Bluesky abstracts over hardware.

Bluesky: an experiment specification and orchestration engine

Specify the logic of an experiment in a hardware-abstracted way. Bluesky says a detector should be triggered; ophyd sorts out how.
First-class support for adaptive feedback between analysis and acquisition.
Data is emitted in a streaming fashion in standard Python data structures.
Pause/resume, robust error handling, and rich metadata capture are built in.

Mix and match (or create your own) plans...

...and streaming-friendly viz...

...and streaming-friendly analysis

High Throughput

Bluesky can process 30k messages/second ("message" = trigger, read, save, ...).
Typically, the vast majority of its time is spent waiting for hardware to move or acquire.
To go faster than that, use kickoff ("Go!") — complete ("Call me when you're done.") — collect ("Read out data asynchronously.").

Suitcase encodes documents for storage or export.

Suitcase: store in any database or file format

Lossless storage: MongoDB, msgpack, JSONL
Lossy export: TIFF, CSV, specfile, ....
Documentation on how to write a suitcase for your own format
Use any transport you like. Write to disk (ordinary files), memory buffer, network socket, ....

DataBroker provides search, access to stored data.

DataBroker takes the hassle out of data access.

An API on top of a database and/or file.
Search user-provided and automatically-captured metadata.
Exactly the same layout originally emitted by Bluesky, so consumer code does not distinguish between "online" and saved data

Keep I/O Separate from Science Logic!

Interfaces, not File Formats

The system is unopinionated about data formats.
Can change storage with no change to consumer code.
Any file I/O happens transparently: the user never sees files, just gets data in memory (e.g. a numpy array, a mapping with labeled metadata).
Your detector writes in a special format?
Register a custom reader.

Embrace Interfaces

The most important aspect of the Bluesky architecture are the well-defined protocols and interfaces.

Interfaces enable:

Interoperable tools without explicit coordination
Unforeseen applications

Interface Example: Iteration in Python

for x in range(10):
    ...

class MyObject:
    def __iter__(self):
        ...

for x in MyObject():
        ...

Interface Example: numpy array protocol

import pandas
import numpy

df = pandas.DataFrame({'intensity': [1,1,2,3]})
numpy.sum(df)

Interfaces in Bluesky

Event Model — connects data producers to consumers
Message protocol — connects experiment sequencing with inspection and execution
Ophyd hardware abstraction — connects what you want to do to how to do it

Embrace Layered extendable code

Embrace Community Open-Source Processes

Work openly

Use version control.
Make new work public from the start.
Put ideas and roadmaps on GitHub issues where others can search, read, comment.

Build a lasting collaboration

Governance model (in process)
Maintainers: per repo, make day-to-day decisions and set processes as appropriate to the repo
Technical Steering Committee: arbitrate when maintainers cannot reach rough consensus
Project Advisory Board: management-level stakeholders, oversee big-picture priorities
Currently in process of assembling these groups

Automated tests are essential

They enable people to try new ideas with confidence.

Ensure that we don't accidentally break our ability to recreate important results.
Ensure that my "improvement" won't accidentally break your research code by protecting it with tests that verify key results.
Continuous Integration services ensure the tests always get run on every proposed change.

Good, current documentation is essential.

It convinces people that it might be easier to learn your thing than to write their own.

Complete installation instructions
Fully worked examples
Tools for simulating data or public links to example data sets

Event Model

Minimalist and Extensible

Every document has a unique ID and a timestamp.
Specific domains, facilities, collaborations, research groups can overlay schemas implementing their own standards (e.g. SciData, PIF).

Bluesky emits documents, streamed or in batches

Bluesky is responsible for organizing metadata and readings from hardware into valid documents.
Sometimes the readings come one at a time and Events are emitted steadily during an experiment.
In special applications (commonly, fly scans) the readings come from the hardware in bulk and Events are emitted in batch(es).

Fly Scans

For high performance fly scanning, coordination is needed "below" Bluesky in hardware.
Bluesky simply provide a way to:
- Configure
- Start ("Kickoff")
- Incrementally collect data ("Collect")
- Initiate or await completion ("Complete")

The status quo in Bluesky is very coarse.
Highly flexible (good place to start...)
But each fly scan application is built from scratch (leads to duplicated efforts)

This is an area of very active development is Bluesky.

Coordinated efforts underway at:

Diamond Light Source
Australian Synchrotron
NSLS0II

Diamond has invested a decade of research into fly-scanning in previous Python projects.
Prototype from Diamond applying this expertise in a Bluesky-compatible way: Bluesky
Work in progress to integrate this with Bluesky itself: bluesky PR#1502

Adaptive experiments

Feedback Paths

prompt / real-time analysis to steer experiment
"human-in-the-loop"
"computer-in-the-loop"
data quality checks

Scales of Adaptive-ness

below bluesky & ophyd
in bluesky plans, but without generating event
providing feedback on a per-event basis
providing feedback on a per-run / start basis
providing feedback across many runs
asynchronous and decoupled feedback

below bluesky & ophyd

timescale: ≫ 10Hz
very limited time budget for analysis
very limited access to data
tightly coupled to hardware (PID loop, FGPA)
expensive to develop

in bluesky plans, but without generating event

timescale: 1-10Hz
limited time budget for analysis
limited access to data
logic implemented in Python in acquisition process
coupled to hardware
can be used for filtering

providing feedback on a per-event basis

timescale: 1-5s
modest time budget for analysis
access to "single point" of data (& cache)
run in or out of acquisition process

providing feedback on a per-run / start basis

timescale: 5-60s
modest time budget for analysis
access to "full scan" data (& cache)
run in or out of acquisition process

providing feedback across many runs

timescale: ∞
arbitarily compute budget
access to all historical data
muliti-modal

asynchronous and decoupled feedback

Is the beam up?
Is the shutter open?
Is the sample still in the beam?
Do we have enough data on this sample?
Is the sample toast?

Docs with theory and examples:

bluesky/bluesky-adaptive

"Queue server": An Editable Control Queue

Bluesky Queue Server

Support remote and multi-tenant data acquisition
Documentation: bluesky-queueserver
Has been used for user experiments
Still under rapid development

Bluesky's first target was users coming from SPEC

Primarly an interactive, human-in-the-loop workflow
REPL (Terminal) based
User has exclusive control, enforced by physical presence

New Capability: Editable Control Queue

Provides an editable queue of Bluesky plans to run
All the same Bluesky plans (experiment procedures) work
All the same Ophyd devices work
Can be easily populated from a user's Excel spreadsheet
You can safely mutate—rearrange and edit items—during acquisition
Well suited to graphical interfaces

Separation between user app and queue server

If app is closed or crashes, acquisition continues. Just restart app to reconnect.
The app can provide access controls
- Guiderails (avoid too many options)
- Security
App can run on different machine from queue
Many client programs can be used simultaneously to monitor or control the queue (web, desktop GUI, commandline)