Important

You can run this notebook in a live session Binder or view it on nbviewer or GitHub.

Process Tabular Data with Pandas

In this notebook you will:

  • Load a data from a scan.

  • Visualize it.

  • Explore the dataset, performing basic filtering and statistic operations using pandas.

Recommended Prerequisites:

Configuration

Below, we will connect to EPICS IOC(s) controlling simulated hardware in lieu of actual motors, detectors. The IOCs should already be running in the background. Run this command to verify that they are running: it should produce output with RUNNING on each line. In the event of a problem, edit this command to replace status with restart all and run again.

[1]:
!supervisorctl -c supervisor/supervisord.conf status
decay                            RUNNING   pid 4977, uptime 0:03:46
mini_beamline                    RUNNING   pid 4978, uptime 0:03:46
random_walk                      RUNNING   pid 4979, uptime 0:03:46
random_walk_horiz                RUNNING   pid 4980, uptime 0:03:46
random_walk_vert                 RUNNING   pid 4981, uptime 0:03:46
simple                           RUNNING   pid 4982, uptime 0:03:46
thermo_sim                       RUNNING   pid 4983, uptime 0:03:46
trigger_with_pc                  RUNNING   pid 4984, uptime 0:03:46
[2]:
%run scripts/beamline_configuration.py
/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/pims/image_reader.py:26: RuntimeWarning: PIMS image_reader.py could not find scikit-image. Falling back to matplotlib's imread(), which uses floats instead of integers. This may break your scripts.
(To ignore this warning, include the line "warnings.simplefilter("ignore", RuntimeWarning)" in your script.)
  warnings.warn(RuntimeWarning(ski_preferred))

Quickly access and explore the data

Let’s take sample data.

[3]:
# Creating a figure explicitly in advance helps with the
# top-to-bottom flow of this notebook, but it is not necessary.
# If this is omitted, bluesky will cause a figure to appear
# during the RE(...) execution below.
plt.figure('I slit_det vs motor_slit')
[3]:
<Figure size 640x480 with 0 Axes>
[4]:
RE(scan([slit, I], motor_slit, -10, 10, 15))


Transient Scan ID: 26     Time: 2020-07-15 20:28:53
Persistent Unique Scan ID: 'e24a362a-e00a-40a6-a48a-ce23c7bbe585'
New stream: 'primary'
+-----------+------------+------------+------------+------------+
|   seq_num |       time | motor_slit |   slit_det |          I |
+-----------+------------+------------+------------+------------+
|         1 | 20:28:54.6 |    -10.000 |       8290 |        523 |
|         2 | 20:28:54.6 |     -8.571 |      28656 |        525 |
|         3 | 20:28:54.6 |     -7.143 |      61382 |        525 |
|         4 | 20:28:54.6 |     -5.714 |      88631 |        524 |
|         5 | 20:28:54.6 |     -4.286 |     101607 |        524 |
|         6 | 20:28:54.6 |     -2.857 |     104580 |        524 |
|         7 | 20:28:54.6 |     -1.429 |     105238 |        524 |
|         8 | 20:28:54.6 |      0.000 |     104476 |        524 |
|         9 | 20:28:54.6 |      1.429 |     104986 |        524 |
|        10 | 20:28:54.6 |      2.857 |     104319 |        524 |
|        11 | 20:28:54.7 |      4.286 |     101189 |        524 |
|        12 | 20:28:54.7 |      5.714 |      88033 |        523 |
|        13 | 20:28:54.8 |      7.143 |      60750 |        521 |
|        14 | 20:28:54.9 |      8.571 |      28335 |        521 |
|        15 | 20:28:54.9 |     10.000 |       7951 |        519 |
+-----------+------------+------------+------------+------------+
generator scan ['e24a362a'] (scan num: 26)



[4]:
('e24a362a-e00a-40a6-a48a-ce23c7bbe585',)
[5]:
plt.gcf()  # Display a snapshot of the current state of the figure.
[5]:
_images/Process_Tabular_Data_with_Pandas_6_0.png

Previously we used the syntax db(...) to search for Headers. We looped through each Header and extracted some of the metadata encapsulated in it. We use the shorthand db[-N] to retrieve exactly one header, the one corresponding to N runs ago.

[6]:
header = db[-1]  # Header for the most recent one (i.e. 1 run ago)

We can access the data as a table. This object is called a pandas.DataFrame, a high-performance, spreadsheet-like object in Python.

[7]:
data = header.table()

It display prettily, truncating if needed.

[8]:
data
[8]:
time slit_det I motor_slit
seq_num
1 2020-07-15 20:28:54.601291895 8290.0 522.920501 -10.000000
2 2020-07-15 20:28:54.614007473 28656.0 524.951248 -8.571429
3 2020-07-15 20:28:54.625183344 61382.0 524.951248 -7.142857
4 2020-07-15 20:28:54.635529757 88631.0 524.396035 -5.714286
5 2020-07-15 20:28:54.645897865 101607.0 524.396035 -4.285714
6 2020-07-15 20:28:54.655539036 104580.0 524.396035 -2.857143
7 2020-07-15 20:28:54.665376186 105238.0 524.396035 -1.428571
8 2020-07-15 20:28:54.674939632 104476.0 524.396035 0.000000
9 2020-07-15 20:28:54.685200453 104986.0 524.396035 1.428571
10 2020-07-15 20:28:54.695705175 104319.0 524.396035 2.857143
11 2020-07-15 20:28:54.719940901 101189.0 524.396035 4.285714
12 2020-07-15 20:28:54.793751240 88033.0 523.210619 5.714286
13 2020-07-15 20:28:54.866239309 60750.0 521.451295 7.142857
14 2020-07-15 20:28:54.942331314 28335.0 521.451295 8.571429
15 2020-07-15 20:28:54.955762863 7951.0 519.172368 10.000000

For large tables, it’s handy to know how to ask for just the first or last couple rows.

It is easy to visualize.

[9]:
data.plot(x='motor_slit', y='slit_det')
[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f04e13054a8>
[10]:
plt.gcf()  # Display a snapshot of the current state of the figure.
[10]:
_images/Process_Tabular_Data_with_Pandas_16_0.png
[11]:
data.head()
[11]:
time slit_det I motor_slit
seq_num
1 2020-07-15 20:28:54.601291895 8290.0 522.920501 -10.000000
2 2020-07-15 20:28:54.614007473 28656.0 524.951248 -8.571429
3 2020-07-15 20:28:54.625183344 61382.0 524.951248 -7.142857
4 2020-07-15 20:28:54.635529757 88631.0 524.396035 -5.714286
5 2020-07-15 20:28:54.645897865 101607.0 524.396035 -4.285714
[12]:
data.tail()
[12]:
time slit_det I motor_slit
seq_num
11 2020-07-15 20:28:54.719940901 101189.0 524.396035 4.285714
12 2020-07-15 20:28:54.793751240 88033.0 523.210619 5.714286
13 2020-07-15 20:28:54.866239309 60750.0 521.451295 7.142857
14 2020-07-15 20:28:54.942331314 28335.0 521.451295 8.571429
15 2020-07-15 20:28:54.955762863 7951.0 519.172368 10.000000

Statistics on columns are simple to compute.

[13]:
data.mean()
[13]:
slit_det      7.322820e+04
I             5.235518e+02
motor_slit    4.736952e-16
dtype: float64
[14]:
data.min()
[14]:
time          2020-07-15 20:28:54.601291895
slit_det                               7951
I                                   519.172
motor_slit                              -10
dtype: object

Or just ask for all the common stats at once:

[15]:
data.describe()
[15]:
slit_det I motor_slit
count 15.000000 15.000000 1.500000e+01
mean 73228.200000 523.551790 4.736952e-16
std 37587.018422 1.650484 6.388766e+00
min 7951.000000 519.172368 -1.000000e+01
25% 44703.000000 523.065560 -5.000000e+00
50% 88631.000000 524.396035 0.000000e+00
75% 104397.500000 524.396035 5.000000e+00
max 105238.000000 524.951248 1.000000e+01

Do analysis, and plot

Normalize by beam current

We can do math on the columns, and we can add new columns.

[16]:
data['normalized'] = data['slit_det'] / data['I'] * data['I'].mean()

We can plot them conveniently, addressing the columns by name.

[17]:
plt.figure()
[17]:
<Figure size 640x480 with 0 Axes>
[18]:
plt.plot('motor_slit', 'slit_det', data=data, label='raw')
plt.plot('motor_slit', 'normalized', data=data, label='normalized')
plt.legend()
plt.gcf()
[18]:
_images/Process_Tabular_Data_with_Pandas_29_0.png

Select only rows with high beam current

Select rows where the current is above 500.

[19]:
data[data['I'] > 500]
[19]:
time slit_det I motor_slit normalized
seq_num
1 2020-07-15 20:28:54.601291895 8290.0 522.920501 -10.000000 8300.008003
2 2020-07-15 20:28:54.614007473 28656.0 524.951248 -8.571429 28579.606532
3 2020-07-15 20:28:54.625183344 61382.0 524.951248 -7.142857 61218.362931
4 2020-07-15 20:28:54.635529757 88631.0 524.396035 -5.714286 88488.309619
5 2020-07-15 20:28:54.645897865 101607.0 524.396035 -4.285714 101443.419068
6 2020-07-15 20:28:54.655539036 104580.0 524.396035 -2.857143 104411.632724
7 2020-07-15 20:28:54.665376186 105238.0 524.396035 -1.428571 105068.573385
8 2020-07-15 20:28:54.674939632 104476.0 524.396035 0.000000 104307.800157
9 2020-07-15 20:28:54.685200453 104986.0 524.396035 1.428571 104816.979089
10 2020-07-15 20:28:54.695705175 104319.0 524.396035 2.857143 104151.052917
11 2020-07-15 20:28:54.719940901 101189.0 524.396035 4.285714 101026.092022
12 2020-07-15 20:28:54.793751240 88033.0 523.210619 5.714286 88090.403934
13 2020-07-15 20:28:54.866239309 60750.0 521.451295 7.142857 60994.711388
14 2020-07-15 20:28:54.942331314 28335.0 521.451295 8.571429 28449.138225
15 2020-07-15 20:28:54.955762863 7951.0 519.172368 10.000000 8018.069807

…or, where the the current is above the 30th percentile.

[20]:
data[data['I'] > data['I'].quantile(0.30)]
[20]:
time slit_det I motor_slit normalized
seq_num
2 2020-07-15 20:28:54.614007473 28656.0 524.951248 -8.571429 28579.606532
3 2020-07-15 20:28:54.625183344 61382.0 524.951248 -7.142857 61218.362931
4 2020-07-15 20:28:54.635529757 88631.0 524.396035 -5.714286 88488.309619
5 2020-07-15 20:28:54.645897865 101607.0 524.396035 -4.285714 101443.419068
6 2020-07-15 20:28:54.655539036 104580.0 524.396035 -2.857143 104411.632724
7 2020-07-15 20:28:54.665376186 105238.0 524.396035 -1.428571 105068.573385
8 2020-07-15 20:28:54.674939632 104476.0 524.396035 0.000000 104307.800157
9 2020-07-15 20:28:54.685200453 104986.0 524.396035 1.428571 104816.979089
10 2020-07-15 20:28:54.695705175 104319.0 524.396035 2.857143 104151.052917
11 2020-07-15 20:28:54.719940901 101189.0 524.396035 4.285714 101026.092022

…and plot just those rows, using a marker so it is easy to see which points have been disqualified due to low current.

[21]:
data[data['I'] > data['I'].quantile(0.30)].plot(x='motor_slit', y='slit_det', marker='o')
[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f04e12379b0>

Histogram the current

[22]:
plt.figure()
[22]:
<Figure size 640x480 with 0 Axes>
[23]:
plt.hist(data['I'], bins='auto')
plt.xlabel('current')
plt.ylabel('count')
plt.gcf()
[23]:
_images/Process_Tabular_Data_with_Pandas_39_0.png

Not enough data to be interesting. Try collecting a bunch of current data by itself and then try this again.

[24]:
# Creating a figure explicitly in advance helps with the
# top-to-bottom flow of this notebook, but it is not necessary.
# If this is omitted, bluesky will cause a figure to appear
# during the RE(...) execution below.
plt.figure("I vs time")
[24]:
<Figure size 640x480 with 0 Axes>
[25]:
RE(count([I], num=200, delay=0.1))


Transient Scan ID: 27     Time: 2020-07-15 20:28:57
Persistent Unique Scan ID: 'a698903d-522c-48a6-9a86-8b1f294d084e'
New stream: 'primary'
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|         1 | 20:28:57.2 |        487 |
|         2 | 20:28:57.3 |        491 |
|         3 | 20:28:57.4 |        494 |
|         4 | 20:28:57.5 |        498 |
|         5 | 20:28:57.6 |        502 |
|         6 | 20:28:57.7 |        506 |
|         7 | 20:28:57.8 |        510 |
|         8 | 20:28:57.9 |        513 |
|         9 | 20:28:58.0 |        519 |
|        10 | 20:28:58.1 |        519 |
|        11 | 20:28:58.2 |        523 |
|        12 | 20:28:58.3 |        523 |
|        13 | 20:28:58.4 |        525 |
|        14 | 20:28:58.5 |        525 |
|        15 | 20:28:58.6 |        524 |
|        16 | 20:28:58.7 |        523 |
|        17 | 20:28:58.8 |        521 |
|        18 | 20:28:58.9 |        519 |
|        19 | 20:28:59.0 |        516 |
|        20 | 20:28:59.1 |        512 |
|        21 | 20:28:59.2 |        509 |
|        22 | 20:28:59.3 |        505 |
|        23 | 20:28:59.4 |        501 |
|        24 | 20:28:59.5 |        497 |
|        25 | 20:28:59.6 |        493 |
|        26 | 20:28:59.7 |        489 |
|        27 | 20:28:59.8 |        486 |
|        28 | 20:28:59.9 |        483 |
|        29 | 20:29:00.1 |        480 |
|        30 | 20:29:00.1 |        478 |
|        31 | 20:29:00.2 |        476 |
|        32 | 20:29:00.3 |        475 |
|        33 | 20:29:00.4 |        475 |
|        34 | 20:29:00.5 |        475 |
|        35 | 20:29:00.6 |        476 |
|        36 | 20:29:00.7 |        477 |
|        37 | 20:29:00.8 |        479 |
|        38 | 20:29:00.9 |        482 |
|        39 | 20:29:01.1 |        485 |
|        40 | 20:29:01.1 |        488 |
|        41 | 20:29:01.2 |        492 |
|        42 | 20:29:01.4 |        496 |
|        43 | 20:29:01.4 |        500 |
|        44 | 20:29:01.5 |        504 |
|        45 | 20:29:01.7 |        507 |
|        46 | 20:29:01.8 |        511 |
|        47 | 20:29:01.9 |        515 |
|        48 | 20:29:02.0 |        518 |
|        49 | 20:29:02.1 |        520 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|        50 | 20:29:02.2 |        522 |
|        51 | 20:29:02.3 |        524 |
|        52 | 20:29:02.4 |        525 |
|        53 | 20:29:02.5 |        525 |
|        54 | 20:29:02.6 |        525 |
|        55 | 20:29:02.7 |        524 |
|        56 | 20:29:02.8 |        522 |
|        57 | 20:29:02.9 |        520 |
|        58 | 20:29:03.0 |        518 |
|        59 | 20:29:03.1 |        515 |
|        60 | 20:29:03.2 |        511 |
|        61 | 20:29:03.3 |        507 |
|        62 | 20:29:03.4 |        504 |
|        63 | 20:29:03.5 |        500 |
|        64 | 20:29:03.6 |        496 |
|        65 | 20:29:03.7 |        492 |
|        66 | 20:29:03.8 |        488 |
|        67 | 20:29:03.9 |        485 |
|        68 | 20:29:04.0 |        482 |
|        69 | 20:29:04.1 |        479 |
|        70 | 20:29:04.2 |        477 |
|        71 | 20:29:04.3 |        476 |
|        72 | 20:29:04.4 |        475 |
|        73 | 20:29:04.5 |        475 |
|        74 | 20:29:04.6 |        475 |
|        75 | 20:29:04.7 |        476 |
|        76 | 20:29:04.8 |        478 |
|        77 | 20:29:04.9 |        480 |
|        78 | 20:29:05.0 |        483 |
|        79 | 20:29:05.1 |        486 |
|        80 | 20:29:05.2 |        489 |
|        81 | 20:29:05.3 |        493 |
|        82 | 20:29:05.4 |        497 |
|        83 | 20:29:05.5 |        501 |
|        84 | 20:29:05.6 |        505 |
|        85 | 20:29:05.7 |        509 |
|        86 | 20:29:05.8 |        512 |
|        87 | 20:29:05.9 |        516 |
|        88 | 20:29:06.0 |        518 |
|        89 | 20:29:06.1 |        521 |
|        90 | 20:29:06.2 |        523 |
|        91 | 20:29:06.3 |        524 |
|        92 | 20:29:06.4 |        525 |
|        93 | 20:29:06.5 |        525 |
|        94 | 20:29:06.6 |        524 |
|        95 | 20:29:06.7 |        523 |
|        96 | 20:29:06.8 |        522 |
|        97 | 20:29:06.9 |        519 |
|        98 | 20:29:07.0 |        517 |
|        99 | 20:29:07.2 |        514 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|       100 | 20:29:07.2 |        510 |
|       101 | 20:29:07.3 |        506 |
|       102 | 20:29:07.5 |        502 |
|       103 | 20:29:07.5 |        498 |
|       104 | 20:29:07.7 |        495 |
|       105 | 20:29:07.8 |        491 |
|       106 | 20:29:07.9 |        487 |
|       107 | 20:29:08.0 |        484 |
|       108 | 20:29:08.1 |        481 |
|       109 | 20:29:08.2 |        479 |
|       110 | 20:29:08.3 |        477 |
|       111 | 20:29:08.4 |        476 |
|       112 | 20:29:08.5 |        475 |
|       113 | 20:29:08.6 |        475 |
|       114 | 20:29:08.7 |        476 |
|       115 | 20:29:08.8 |        479 |
|       116 | 20:29:08.9 |        479 |
|       117 | 20:29:09.0 |        481 |
|       118 | 20:29:09.1 |        484 |
|       119 | 20:29:09.2 |        487 |
|       120 | 20:29:09.3 |        491 |
|       121 | 20:29:09.4 |        498 |
|       122 | 20:29:09.5 |        498 |
|       123 | 20:29:09.6 |        506 |
|       124 | 20:29:09.7 |        510 |
|       125 | 20:29:09.8 |        513 |
|       126 | 20:29:09.9 |        516 |
|       127 | 20:29:10.0 |        519 |
|       128 | 20:29:10.1 |        521 |
|       129 | 20:29:10.2 |        523 |
|       130 | 20:29:10.3 |        524 |
|       131 | 20:29:10.4 |        525 |
|       132 | 20:29:10.5 |        525 |
|       133 | 20:29:10.6 |        524 |
|       134 | 20:29:10.7 |        523 |
|       135 | 20:29:10.8 |        521 |
|       136 | 20:29:10.9 |        519 |
|       137 | 20:29:11.0 |        516 |
|       138 | 20:29:11.1 |        513 |
|       139 | 20:29:11.2 |        509 |
|       140 | 20:29:11.3 |        505 |
|       141 | 20:29:11.4 |        501 |
|       142 | 20:29:11.5 |        497 |
|       143 | 20:29:11.6 |        493 |
|       144 | 20:29:11.7 |        490 |
|       145 | 20:29:11.8 |        486 |
|       146 | 20:29:11.9 |        483 |
|       147 | 20:29:12.1 |        480 |
|       148 | 20:29:12.1 |        478 |
|       149 | 20:29:12.2 |        477 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|       150 | 20:29:12.3 |        475 |
|       151 | 20:29:12.4 |        475 |
|       152 | 20:29:12.5 |        475 |
|       153 | 20:29:12.6 |        476 |
|       154 | 20:29:12.8 |        477 |
|       155 | 20:29:12.8 |        479 |
|       156 | 20:29:12.9 |        482 |
|       157 | 20:29:13.0 |        485 |
|       158 | 20:29:13.1 |        488 |
|       159 | 20:29:13.3 |        492 |
|       160 | 20:29:13.3 |        495 |
|       161 | 20:29:13.4 |        499 |
|       162 | 20:29:13.6 |        503 |
|       163 | 20:29:13.6 |        507 |
|       164 | 20:29:13.8 |        511 |
|       165 | 20:29:13.9 |        514 |
|       166 | 20:29:14.0 |        517 |
|       167 | 20:29:14.1 |        520 |
|       168 | 20:29:14.2 |        522 |
|       169 | 20:29:14.3 |        524 |
|       170 | 20:29:14.4 |        525 |
|       171 | 20:29:14.5 |        525 |
|       172 | 20:29:14.6 |        525 |
|       173 | 20:29:14.7 |        524 |
|       174 | 20:29:14.8 |        522 |
|       175 | 20:29:14.9 |        520 |
|       176 | 20:29:15.0 |        518 |
|       177 | 20:29:15.1 |        515 |
|       178 | 20:29:15.2 |        511 |
|       179 | 20:29:15.3 |        508 |
|       180 | 20:29:15.4 |        504 |
|       181 | 20:29:15.5 |        500 |
|       182 | 20:29:15.6 |        496 |
|       183 | 20:29:15.7 |        492 |
|       184 | 20:29:15.8 |        488 |
|       185 | 20:29:15.9 |        485 |
|       186 | 20:29:16.0 |        482 |
|       187 | 20:29:16.1 |        480 |
|       188 | 20:29:16.2 |        478 |
|       189 | 20:29:16.3 |        476 |
|       190 | 20:29:16.4 |        475 |
|       191 | 20:29:16.5 |        475 |
|       192 | 20:29:16.6 |        475 |
|       193 | 20:29:16.7 |        476 |
|       194 | 20:29:16.8 |        478 |
|       195 | 20:29:16.9 |        480 |
|       196 | 20:29:17.0 |        483 |
|       197 | 20:29:17.1 |        486 |
|       198 | 20:29:17.2 |        489 |
|       199 | 20:29:17.3 |        493 |
+-----------+------------+------------+
|   seq_num |       time |          I |
+-----------+------------+------------+
|       200 | 20:29:17.4 |        497 |
+-----------+------------+------------+
generator count ['a698903d'] (scan num: 27)



[25]:
('a698903d-522c-48a6-9a86-8b1f294d084e',)
[26]:
plt.gcf()  # Display a snapshot of the current state of the figure.
[26]:
_images/Process_Tabular_Data_with_Pandas_43_0.png
[27]:
plt.figure()
[27]:
<Figure size 640x480 with 0 Axes>
[28]:
more_current_data = db[-1].table()
plt.hist(more_current_data['I'], bins='auto')
plt.xlabel('current')
plt.ylabel('count')
plt.gcf()
[28]:
_images/Process_Tabular_Data_with_Pandas_45_0.png

Exercises

Spend some time reading 10 minutes to pandas, and use this notebook to try some of the examples there.

[29]:
# Hack away....