Important

You can run this notebook in a live session Binder or view it on nbviewer or GitHub.

Process Tabular Data with Pandas

In this notebook you will:

  • Load a data from a scan.

  • Visualize it.

  • Explore the dataset, performing basic filtering and statistic operations using pandas.

Recommended Prerequisites:

Configuration

Below, we will connect to EPICS IOC(s) controlling simulated hardware in lieu of actual motors, detectors. The IOCs should already be running in the background. Run this command to verify that they are running: it should produce output with RUNNING on each line. In the event of a problem, edit this command to replace status with restart all and run again.

[1]:
!supervisorctl -c supervisor/supervisord.conf status
decay                            RUNNING   pid 4872, uptime 0:03:39
mini_beamline                    RUNNING   pid 4873, uptime 0:03:39
random_walk                      RUNNING   pid 4874, uptime 0:03:39
random_walk_horiz                RUNNING   pid 4875, uptime 0:03:39
random_walk_vert                 RUNNING   pid 4876, uptime 0:03:39
simple                           RUNNING   pid 4877, uptime 0:03:39
thermo_sim                       RUNNING   pid 4878, uptime 0:03:39
trigger_with_pc                  RUNNING   pid 4879, uptime 0:03:39
[2]:
%run scripts/beamline_configuration.py

Quickly access and explore the data

Let’s take sample data.

[3]:
# Creating a figure explicitly in advance helps with the
# top-to-bottom flow of this notebook, but it is not necessary.
# If this is omitted, bluesky will cause a figure to appear
# during the RE(...) execution below.
plt.figure('I slit_det vs motor_slit')
[3]:
<Figure size 640x480 with 0 Axes>
[4]:
RE(scan([slit, I], motor_slit, -10, 10, 15))


Transient Scan ID: 26     Time: 2020-06-05 15:46:56
Persistent Unique Scan ID: '73c069d5-d686-47b0-9603-9ac7f503c3d0'
New stream: 'primary'
+-----------+------------+------------+------------+------------+
|   seq_num |       time | motor_slit |          I |   slit_det |
+-----------+------------+------------+------------+------------+
|         1 | 15:46:56.8 |    -10.000 |        501 |       7827 |
|         2 | 15:46:56.9 |     -8.571 |        509 |      27787 |
/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/bluesky/utils.py:1502: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations.
  f_mgr.canvas.draw()
|         3 | 15:46:57.0 |     -7.143 |        509 |      58986 |
|         4 | 15:46:57.0 |     -5.714 |        512 |      86939 |
|         5 | 15:46:57.0 |     -4.286 |        512 |      98383 |
|         6 | 15:46:57.1 |     -2.857 |        512 |     102131 |
|         7 | 15:46:57.1 |     -1.429 |        512 |     102556 |
|         8 | 15:46:57.1 |      0.000 |        512 |     101566 |
|         9 | 15:46:57.1 |      1.429 |        512 |     102500 |
|        10 | 15:46:57.1 |      2.857 |        516 |     102003 |
|        11 | 15:46:57.1 |      4.286 |        516 |     100031 |
|        12 | 15:46:57.1 |      5.714 |        516 |      86773 |
|        13 | 15:46:57.1 |      7.143 |        516 |      59664 |
|        14 | 15:46:57.1 |      8.571 |        516 |      27770 |
|        15 | 15:46:57.1 |     10.000 |        516 |       7975 |
+-----------+------------+------------+------------+------------+
generator scan ['73c069d5'] (scan num: 26)



[4]:
('73c069d5-d686-47b0-9603-9ac7f503c3d0',)
[5]:
plt.gcf()  # Display a snapshot of the current state of the figure.
[5]:
_images/Process_Tabular_Data_with_Pandas_6_0.png

Previously we used the syntax db(...) to search for Headers. We looped through each Header and extracted some of the metadata encapsulated in it. We use the shorthand db[-N] to retrieve exactly one header, the one corresponding to N runs ago.

[6]:
header = db[-1]  # Header for the most recent one (i.e. 1 run ago)

We can access the data as a table. This object is called a pandas.DataFrame, a high-performance, spreadsheet-like object in Python.

[7]:
data = header.table()

It display prettily, truncating if needed.

[8]:
data
[8]:
time I motor_slit slit_det
seq_num
1 2020-06-05 15:46:56.899140358 501.044141 -10.000000 7827.0
2 2020-06-05 15:46:56.956779242 508.758907 -8.571429 27787.0
3 2020-06-05 15:46:57.009317636 508.758907 -7.142857 58986.0
4 2020-06-05 15:46:57.053654432 512.337328 -5.714286 86939.0
5 2020-06-05 15:46:57.092668772 512.337328 -4.285714 98383.0
6 2020-06-05 15:46:57.102864981 512.337328 -2.857143 102131.0
7 2020-06-05 15:46:57.112307549 512.337328 -1.428571 102556.0
8 2020-06-05 15:46:57.122255802 512.337328 0.000000 101566.0
9 2020-06-05 15:46:57.131747723 512.337328 1.428571 102500.0
10 2020-06-05 15:46:57.141118526 515.595164 2.857143 102003.0
11 2020-06-05 15:46:57.150413990 515.595164 4.285714 100031.0
12 2020-06-05 15:46:57.159682989 515.595164 5.714286 86773.0
13 2020-06-05 15:46:57.168794870 515.595164 7.142857 59664.0
14 2020-06-05 15:46:57.177879095 515.595164 8.571429 27770.0
15 2020-06-05 15:46:57.191147327 515.595164 10.000000 7975.0

For large tables, it’s handy to know how to ask for just the first or last couple rows.

It is easy to visualize.

[9]:
data.plot(x='motor_slit', y='slit_det')
[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6a5938dd8>
[10]:
plt.gcf()  # Display a snapshot of the current state of the figure.
[10]:
_images/Process_Tabular_Data_with_Pandas_16_0.png
[11]:
data.head()
[11]:
time I motor_slit slit_det
seq_num
1 2020-06-05 15:46:56.899140358 501.044141 -10.000000 7827.0
2 2020-06-05 15:46:56.956779242 508.758907 -8.571429 27787.0
3 2020-06-05 15:46:57.009317636 508.758907 -7.142857 58986.0
4 2020-06-05 15:46:57.053654432 512.337328 -5.714286 86939.0
5 2020-06-05 15:46:57.092668772 512.337328 -4.285714 98383.0
[12]:
data.tail()
[12]:
time I motor_slit slit_det
seq_num
11 2020-06-05 15:46:57.150413990 515.595164 4.285714 100031.0
12 2020-06-05 15:46:57.159682989 515.595164 5.714286 86773.0
13 2020-06-05 15:46:57.168794870 515.595164 7.142857 59664.0
14 2020-06-05 15:46:57.177879095 515.595164 8.571429 27770.0
15 2020-06-05 15:46:57.191147327 515.595164 10.000000 7975.0

Statistics on columns are simple to compute.

[13]:
data.mean()
[13]:
I             5.124105e+02
motor_slit    4.736952e-16
slit_det      7.152607e+04
dtype: float64
[14]:
data.min()
[14]:
time          2020-06-05 15:46:56.899140358
I                                   501.044
motor_slit                              -10
slit_det                               7827
dtype: object

Or just ask for all the common stats at once:

[15]:
data.describe()
[15]:
I motor_slit slit_det
count 15.000000 1.500000e+01 15.000000
mean 512.410460 4.736952e-16 71526.066667
std 3.934708 6.388766e+00 36757.864353
min 501.044141 -1.000000e+01 7827.000000
25% 512.337328 -5.000000e+00 43386.500000
50% 512.337328 0.000000e+00 86939.000000
75% 515.595164 5.000000e+00 101784.500000
max 515.595164 1.000000e+01 102556.000000

Do analysis, and plot

Normalize by beam current

We can do math on the columns, and we can add new columns.

[16]:
data['normalized'] = data['slit_det'] / data['I'] * data['I'].mean()

We can plot them conveniently, addressing the columns by name.

[17]:
plt.figure()
[17]:
<Figure size 640x480 with 0 Axes>
[18]:
plt.plot('motor_slit', 'slit_det', data=data, label='raw')
plt.plot('motor_slit', 'normalized', data=data, label='normalized')
plt.legend()
plt.gcf()
[18]:
_images/Process_Tabular_Data_with_Pandas_29_0.png

Select only rows with high beam current

Select rows where the current is above 500.

[19]:
data[data['I'] > 500]
[19]:
time I motor_slit slit_det normalized
seq_num
1 2020-06-05 15:46:56.899140358 501.044141 -10.000000 7827.0 8004.557569
2 2020-06-05 15:46:56.956779242 508.758907 -8.571429 27787.0 27986.437716
3 2020-06-05 15:46:57.009317636 508.758907 -7.142857 58986.0 59409.364636
4 2020-06-05 15:46:57.053654432 512.337328 -5.714286 86939.0 86951.409882
5 2020-06-05 15:46:57.092668772 512.337328 -4.285714 98383.0 98397.043426
6 2020-06-05 15:46:57.102864981 512.337328 -2.857143 102131.0 102145.578425
7 2020-06-05 15:46:57.112307549 512.337328 -1.428571 102556.0 102570.639090
8 2020-06-05 15:46:57.122255802 512.337328 0.000000 101566.0 101580.497775
9 2020-06-05 15:46:57.131747723 512.337328 1.428571 102500.0 102514.631097
10 2020-06-05 15:46:57.141118526 515.595164 2.857143 102003.0 101372.952807
11 2020-06-05 15:46:57.150413990 515.595164 4.285714 100031.0 99413.133361
12 2020-06-05 15:46:57.159682989 515.595164 5.714286 86773.0 86237.024734
13 2020-06-05 15:46:57.168794870 515.595164 7.142857 59664.0 59295.470293
14 2020-06-05 15:46:57.177879095 515.595164 8.571429 27770.0 27598.471608
15 2020-06-05 15:46:57.191147327 515.595164 10.000000 7975.0 7925.740406

…or, where the the current is above the 30th percentile.

[20]:
data[data['I'] > data['I'].quantile(0.30)]
[20]:
time I motor_slit slit_det normalized
seq_num
10 2020-06-05 15:46:57.141118526 515.595164 2.857143 102003.0 101372.952807
11 2020-06-05 15:46:57.150413990 515.595164 4.285714 100031.0 99413.133361
12 2020-06-05 15:46:57.159682989 515.595164 5.714286 86773.0 86237.024734
13 2020-06-05 15:46:57.168794870 515.595164 7.142857 59664.0 59295.470293
14 2020-06-05 15:46:57.177879095 515.595164 8.571429 27770.0 27598.471608
15 2020-06-05 15:46:57.191147327 515.595164 10.000000 7975.0 7925.740406

…and plot just those rows, using a marker so it is easy to see which points have been disqualified due to low current.

[21]:
data[data['I'] > data['I'].quantile(0.30)].plot(x='motor_slit', y='slit_det', marker='o')
[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa65ffe37b8>

Histogram the current

[22]:
plt.figure()
[22]:
<Figure size 640x480 with 0 Axes>
[23]:
plt.hist(data['I'], bins='auto')
plt.xlabel('current')
plt.ylabel('count')
plt.gcf()
[23]:
_images/Process_Tabular_Data_with_Pandas_39_0.png

Not enough data to be interesting. Try collecting a bunch of current data by itself and then try this again.

[24]:
# Creating a figure explicitly in advance helps with the
# top-to-bottom flow of this notebook, but it is not necessary.
# If this is omitted, bluesky will cause a figure to appear
# during the RE(...) execution below.
plt.figure("I vs time")
[24]:
<Figure size 640x480 with 0 Axes>
[25]:
RE(count([I], num=200, delay=0.1))


Transient Scan ID: 27     Time: 2020-06-05 15:46:59
Persistent Unique Scan ID: 'f8606814-09a4-4f3e-80ec-5f2b44554eac'
New stream: 'primary'
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|         1 | 15:46:59.2 |        484 |
|         2 | 15:46:59.3 |        481 |
/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/bluesky/utils.py:1502: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
  f_mgr.canvas.draw()
|         3 | 15:46:59.4 |        479 |
|         4 | 15:46:59.5 |        477 |
|         5 | 15:46:59.6 |        476 |
|         6 | 15:46:59.7 |        475 |
|         7 | 15:46:59.8 |        475 |
|         8 | 15:46:59.9 |        477 |
|         9 | 15:47:00.0 |        477 |
|        10 | 15:47:00.1 |        481 |
|        11 | 15:47:00.2 |        484 |
|        12 | 15:47:00.3 |        487 |
|        13 | 15:47:00.4 |        490 |
|        14 | 15:47:00.5 |        494 |
|        15 | 15:47:00.6 |        498 |
|        16 | 15:47:00.7 |        502 |
|        17 | 15:47:00.8 |        506 |
|        18 | 15:47:00.9 |        510 |
|        19 | 15:47:01.0 |        513 |
|        20 | 15:47:01.1 |        516 |
|        21 | 15:47:01.2 |        519 |
|        22 | 15:47:01.3 |        521 |
|        23 | 15:47:01.4 |        523 |
|        24 | 15:47:01.5 |        524 |
|        25 | 15:47:01.6 |        525 |
|        26 | 15:47:01.7 |        525 |
|        27 | 15:47:01.8 |        524 |
|        28 | 15:47:01.9 |        523 |
|        29 | 15:47:02.1 |        521 |
|        30 | 15:47:02.1 |        519 |
|        31 | 15:47:02.2 |        516 |
|        32 | 15:47:02.4 |        513 |
|        33 | 15:47:02.4 |        509 |
|        34 | 15:47:02.6 |        505 |
|        35 | 15:47:02.7 |        501 |
|        36 | 15:47:02.8 |        497 |
|        37 | 15:47:02.9 |        494 |
|        38 | 15:47:03.0 |        490 |
|        39 | 15:47:03.1 |        486 |
|        40 | 15:47:03.2 |        483 |
|        41 | 15:47:03.3 |        480 |
|        42 | 15:47:03.4 |        478 |
|        43 | 15:47:03.5 |        477 |
|        44 | 15:47:03.6 |        476 |
|        45 | 15:47:03.7 |        475 |
|        46 | 15:47:03.8 |        475 |
|        47 | 15:47:03.9 |        476 |
|        48 | 15:47:04.0 |        477 |
|        49 | 15:47:04.1 |        479 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|        50 | 15:47:04.2 |        482 |
|        51 | 15:47:04.3 |        485 |
|        52 | 15:47:04.4 |        488 |
|        53 | 15:47:04.5 |        492 |
|        54 | 15:47:04.6 |        495 |
|        55 | 15:47:04.7 |        499 |
|        56 | 15:47:04.8 |        503 |
|        57 | 15:47:04.9 |        507 |
|        58 | 15:47:05.0 |        511 |
|        59 | 15:47:05.1 |        514 |
|        60 | 15:47:05.2 |        517 |
|        61 | 15:47:05.3 |        520 |
|        62 | 15:47:05.4 |        522 |
|        63 | 15:47:05.5 |        524 |
|        64 | 15:47:05.6 |        525 |
|        65 | 15:47:05.7 |        525 |
|        66 | 15:47:05.8 |        525 |
|        67 | 15:47:05.9 |        524 |
|        68 | 15:47:06.0 |        522 |
|        69 | 15:47:06.1 |        520 |
|        70 | 15:47:06.2 |        518 |
|        71 | 15:47:06.3 |        515 |
|        72 | 15:47:06.4 |        512 |
|        73 | 15:47:06.5 |        508 |
|        74 | 15:47:06.6 |        504 |
|        75 | 15:47:06.7 |        500 |
|        76 | 15:47:06.8 |        496 |
|        77 | 15:47:07.0 |        492 |
|        78 | 15:47:07.0 |        489 |
|        79 | 15:47:07.1 |        485 |
|        80 | 15:47:07.2 |        482 |
|        81 | 15:47:07.3 |        480 |
|        82 | 15:47:07.4 |        478 |
|        83 | 15:47:07.6 |        476 |
|        84 | 15:47:07.6 |        475 |
|        85 | 15:47:07.7 |        475 |
|        86 | 15:47:07.9 |        475 |
|        87 | 15:47:07.9 |        476 |
|        88 | 15:47:08.0 |        478 |
|        89 | 15:47:08.2 |        480 |
|        90 | 15:47:08.3 |        483 |
|        91 | 15:47:08.4 |        486 |
|        92 | 15:47:08.5 |        489 |
|        93 | 15:47:08.6 |        493 |
|        94 | 15:47:08.7 |        497 |
|        95 | 15:47:08.8 |        501 |
|        96 | 15:47:08.9 |        505 |
|        97 | 15:47:09.0 |        508 |
|        98 | 15:47:09.1 |        512 |
|        99 | 15:47:09.2 |        515 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|       100 | 15:47:09.3 |        518 |
|       101 | 15:47:09.4 |        523 |
|       102 | 15:47:09.5 |        523 |
|       103 | 15:47:09.6 |        524 |
|       104 | 15:47:09.7 |        525 |
|       105 | 15:47:09.8 |        525 |
|       106 | 15:47:09.9 |        525 |
|       107 | 15:47:10.0 |        523 |
|       108 | 15:47:10.1 |        522 |
|       109 | 15:47:10.2 |        520 |
|       110 | 15:47:10.3 |        514 |
|       111 | 15:47:10.4 |        514 |
|       112 | 15:47:10.5 |        510 |
|       113 | 15:47:10.6 |        503 |
|       114 | 15:47:10.7 |        499 |
|       115 | 15:47:10.8 |        495 |
|       116 | 15:47:10.9 |        491 |
|       117 | 15:47:11.0 |        488 |
|       118 | 15:47:11.1 |        484 |
|       119 | 15:47:11.2 |        482 |
|       120 | 15:47:11.3 |        479 |
|       121 | 15:47:11.4 |        477 |
|       122 | 15:47:11.5 |        476 |
|       123 | 15:47:11.6 |        475 |
|       124 | 15:47:11.7 |        475 |
|       125 | 15:47:11.8 |        476 |
|       126 | 15:47:11.9 |        477 |
|       127 | 15:47:12.0 |        478 |
|       128 | 15:47:12.1 |        481 |
|       129 | 15:47:12.2 |        483 |
|       130 | 15:47:12.3 |        487 |
|       131 | 15:47:12.4 |        490 |
|       132 | 15:47:12.5 |        494 |
|       133 | 15:47:12.6 |        498 |
|       134 | 15:47:12.7 |        502 |
|       135 | 15:47:12.8 |        506 |
|       136 | 15:47:12.9 |        509 |
|       137 | 15:47:13.0 |        513 |
|       138 | 15:47:13.1 |        516 |
|       139 | 15:47:13.2 |        519 |
|       140 | 15:47:13.3 |        521 |
|       141 | 15:47:13.4 |        523 |
|       142 | 15:47:13.5 |        524 |
|       143 | 15:47:13.7 |        525 |
|       144 | 15:47:13.7 |        525 |
|       145 | 15:47:13.8 |        524 |
|       146 | 15:47:14.0 |        523 |
|       147 | 15:47:14.0 |        521 |
|       148 | 15:47:14.2 |        519 |
|       149 | 15:47:14.3 |        516 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|       150 | 15:47:14.4 |        513 |
|       151 | 15:47:14.5 |        509 |
|       152 | 15:47:14.6 |        505 |
|       153 | 15:47:14.7 |        501 |
|       154 | 15:47:14.8 |        497 |
|       155 | 15:47:14.9 |        494 |
|       156 | 15:47:15.0 |        490 |
|       157 | 15:47:15.1 |        486 |
|       158 | 15:47:15.2 |        483 |
|       159 | 15:47:15.3 |        481 |
|       160 | 15:47:15.4 |        478 |
|       161 | 15:47:15.5 |        477 |
|       162 | 15:47:15.6 |        476 |
|       163 | 15:47:15.7 |        475 |
|       164 | 15:47:15.8 |        475 |
|       165 | 15:47:15.9 |        476 |
|       166 | 15:47:16.0 |        477 |
|       167 | 15:47:16.1 |        479 |
|       168 | 15:47:16.2 |        482 |
|       169 | 15:47:16.3 |        485 |
|       170 | 15:47:16.4 |        488 |
|       171 | 15:47:16.5 |        491 |
|       172 | 15:47:16.6 |        495 |
|       173 | 15:47:16.7 |        499 |
|       174 | 15:47:16.8 |        503 |
|       175 | 15:47:16.9 |        507 |
|       176 | 15:47:17.0 |        511 |
|       177 | 15:47:17.1 |        514 |
|       178 | 15:47:17.2 |        517 |
|       179 | 15:47:17.3 |        520 |
|       180 | 15:47:17.4 |        522 |
|       181 | 15:47:17.5 |        524 |
|       182 | 15:47:17.6 |        525 |
|       183 | 15:47:17.7 |        525 |
|       184 | 15:47:17.8 |        525 |
|       185 | 15:47:17.9 |        524 |
|       186 | 15:47:18.0 |        522 |
|       187 | 15:47:18.1 |        520 |
|       188 | 15:47:18.3 |        515 |
|       189 | 15:47:18.3 |        515 |
|       190 | 15:47:18.5 |        508 |
|       191 | 15:47:18.5 |        508 |
|       192 | 15:47:18.6 |        504 |
|       193 | 15:47:18.8 |        500 |
|       194 | 15:47:18.9 |        492 |
|       195 | 15:47:19.0 |        492 |
|       196 | 15:47:19.1 |        485 |
|       197 | 15:47:19.2 |        482 |
|       198 | 15:47:19.3 |        480 |
|       199 | 15:47:19.4 |        478 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|       200 | 15:47:19.5 |        476 |
+-----------+------------+------------+
generator count ['f8606814'] (scan num: 27)



[25]:
('f8606814-09a4-4f3e-80ec-5f2b44554eac',)
[26]:
plt.gcf()  # Display a snapshot of the current state of the figure.
[26]:
_images/Process_Tabular_Data_with_Pandas_43_0.png
[27]:
plt.figure()
[27]:
<Figure size 640x480 with 0 Axes>
[28]:
more_current_data = db[-1].table()
plt.hist(more_current_data['I'], bins='auto')
plt.xlabel('current')
plt.ylabel('count')
plt.gcf()
[28]:
_images/Process_Tabular_Data_with_Pandas_45_0.png

Exercises

Spend some time reading 10 minutes to pandas, and use this notebook to try some of the examples there.

[29]:
# Hack away....