Structures

Tiled Adapters provide data in one of a fixed group of standard structure families. These are not Python-specific structures. They can be encoded in standard, language-agnostic formats and transferred from the service to a client in potentially any language.

Supported structure families

The most commonly-used structure families are:

  • array — a strided array, like a numpy array

  • dataframe — tabular data, as in Apache Arrow or pandas

  • node — a grouping of other structures, akin to a directory

Additional structures come from xarray. They may be considered containers for one or more strided arrays.

  • xarray_data_array

  • xarray_dataset

Support for Awkward Array is planned.

How structure is encoded

Tiled can describe a structure—its shape, chunking, labels, and so on—for the client so that the client can intelligently request the pieces that it wants.

The structures encodings are designed to be as unoriginal as possible, using established standards and, where some invention is required, using established names from numpy, pandas/Arrow, xarray, and dask.

The structures are encoded in two parts:

  • Macrostructure — This is the high-level structure including things like shape, chunk shape, number of partitions, and column names. This structure has meaning to the server and shows up in the HTTP API.

  • Microstructure — This is low-level structure including things like machine data type(s) and partition boundary locations. It enables the service-side Adapter to communicate to the client how to interpret the bytes that represent a given “tile” of data.

Examples

These examples were generated by serving the demo tree

tiled serve pyobject --public tiled.examples.generated:tree

making an HTTP request with httpie and then extracting the portion of interest with jq, as shown below.

Array (single chunk)

An array is described with a shape, chunk sizes, and a data type. The parameterization and spelling of the data type follows the numpy __array_interface__ protocol. Both built-in data types and strucuted data types are supported.

An optional field, dims (“dimensions”) may contain a list with a string label for each dimension.

This (10, 10)-shaped array fits in a single (10, 10)-shaped chunk.

$ http :8000/node/metadata/small_image | jq .data.attributes.structure
{
  "macro": {
    "chunks": [
      [
        100
      ],
      [
        100
      ]
    ],
    "shape": [
      100,
      100
    ],
    "dims": null,
    "resizable": false
  },
  "micro": {
    "endianness": "little",
    "kind": "f",
    "itemsize": 8
  }
}

Array (multiple chunks)

This (10000, 10000)-shaped array is subdivided into 4 × 4 = 16 chunks, (2500, 2500). Chunks do not in general have to be equally-sized, which is why the size of each chunk is given explicitly.

$ http :8000/node/metadata/big_image | jq .data.attributes.structure
{
  "macro": {
    "chunks": [
      [
        2500,
        2500,
        2500,
        2500
      ],
      [
        2500,
        2500,
        2500,
        2500
      ]
    ],
    "shape": [
      10000,
      10000
    ],
    "dims": null,
    "resizable": false
  },
  "micro": {
    "endianness": "little",
    "kind": "f",
    "itemsize": 8
  }
}

Array (with a structured data type)

This is a 1D array where each item has internal structure, as in numpy’s strucuted data types

$ http :8000/node/metadata/structured_data/pets | jq .data.attributes.structure
{
  "macro": {
    "chunks": [
      [
        2
      ]
    ],
    "shape": [
      2
    ],
    "dims": null,
    "resizable": false
  },
  "micro": {
    "itemsize": 48,
    "fields": [
      {
        "name": "name",
        "dtype": {
          "endianness": "little",
          "kind": "U",
          "itemsize": 40
        },
        "shape": null
      },
      {
        "name": "age",
        "dtype": {
          "endianness": "little",
          "kind": "i",
          "itemsize": 4
        },
        "shape": null
      },
      {
        "name": "weight",
        "dtype": {
          "endianness": "little",
          "kind": "f",
          "itemsize": 4
        },
        "shape": null
      }
    ]
  }
}

DataFrame

With dataframes, we speak of “partitions” instead of “chunks”. There are a couple important distinctions. We always know the size of chunk before we ask for it, but we will not know the number of rows in a partition until we actually read it and enumerate them. Therefore, we cannot slice into dataframes the same way that we can slice in to arrays. We can ask for a subset of the columns, and we can fetch partitions one at a time in any order, but we cannot make requests like “rows 100-200”. (Dask has the same limitation, for the same reason.)

$ http :8000/node/metadata/long_table | jq .data.attributes.structure
{
  "macro": {
    "npartitions": 5,
    "columns": [
      "A",
      "B",
      "C"
    ],
    "resizable": false
  },
  "micro": {
    "meta": "data:application/vnd.apache.arrow.file;base64,...",
    "divisions": "data:application/vnd.apache.arrow.file;base64,...",
  }
}

Notice that the microstructure contains base64-encoded data. The correct way to encode dataframes and their data types in a cross-language way is with Apache Arrow. Apache Arrow is a binary format. It explicitly does not support JSON. (There is a JSON implementation, but the documentation states that it is intended only for integration testing and should not be used by external code.) Therefore, when JSON is requested, we base64-encode it. When binary msgpack is requested instead of JSON, we pack the binary data directly.

The microstructure has two parts:

  • meta — This contains the names and data types of the columns and index. To generate this we build a dataframe with zero rows in it but the same columns and indexes as the original, and then serialize that with Arrow.

  • divisions — This contains the index values that delineate each partition. We generate this in a similar way.

Both of the concepts (and their names) are borrowed directly from dask.dataframe. They should enable any client, including in languages other than Python, to perform the same function.

Data Array (xarray)

A DataArray is an array with labeled dimensions, grouped with optional “coordinates”, which are tick labels for the dimensions.

Here is an example DataArray that holds only an array, without coordinates.

$ http :8000/node/metadata/structured_data/xarray_data_array | jq .data.attributes.structure
{
  "macro": {
    "variable": {
      "macro": {
        "chunks": [
          [
            1000
          ],
          [
            1000
          ]
        ],
        "shape": [
          1000,
          1000
        ],
        "dims": [
          "x",
          "y"
        ],
        "resizable": false
      },
      "micro": {
        "endianness": "little",
        "kind": "f",
        "itemsize": 8
      }
    },
    "coords": {},
    "coord_names": [],
    "name": null,
    "resizable": false
  }
}

And here is an example DataArray with an array and coordinates:

$ http :8000/node/metadata/structured_data/image_with_coords | jq .data.attributes.structure
{
  "macro": {
    "variable": {
      "macro": {
        "chunks": [
          [
            1000
          ],
          [
            1000
          ]
        ],
        "shape": [
          1000,
          1000
        ],
        "dims": [
          "x",
          "y"
        ],
        "resizable": false
      },
      "micro": {
        "endianness": "little",
        "kind": "f",
        "itemsize": 8
      }
    },
    "coords": {
      "x": {
        "macro": {
          "variable": {
            "macro": {
              "chunks": [
                [
                  1000
                ]
              ],
              "shape": [
                1000
              ],
              "dims": [
                "x"
              ],
              "resizable": false
            },
            "micro": {
              "endianness": "little",
              "kind": "f",
              "itemsize": 8
            }
          },
          "coords": null,
          "coord_names": [
            "x"
          ],
          "name": "x",
          "resizable": false
        },
        "micro": null
      },
      "y": {
        "macro": {
          "variable": {
            "macro": {
              "chunks": [
                [
                  1000
                ]
              ],
              "shape": [
                1000
              ],
              "dims": [
                "y"
              ],
              "resizable": false
            },
            "micro": {
              "endianness": "little",
              "kind": "f",
              "itemsize": 8
            }
          },
          "coords": null,
          "coord_names": [
            "y"
          ],
          "name": "y",
          "resizable": false
        },
        "micro": null
      }
    },
    "coord_names": [
      "x",
      "y"
    ],
    "name": null,
    "resizable": false
  }
}

Dataset (xarray)

A Dataset is a dict-like collection of DataArrays that may share coordinates.

$ http :8000/node/metadata/structured_data/xarray_dataset | jq .data.attributes.structure
{
  "macro": {
    "data_vars": {
      "image": {
        "macro": {
          "variable": {
            "macro": {
              "chunks": [
                [
                  1000
                ],
                [
                  1000
                ]
              ],
              "shape": [
                1000,
                1000
              ],
              "dims": [
                "x",
                "y"
              ],
              "resizable": false
            },
            "micro": {
              "endianness": "little",
              "kind": "f",
              "itemsize": 8
            }
          },
          "coords": null,
          "coord_names": [
            "x",
            "y"
          ],
          "name": "image",
          "resizable": false
        },
        "micro": null
      },
      "z": {
        "macro": {
          "variable": {
            "macro": {
              "chunks": [
                [
                  1000
                ]
              ],
              "shape": [
                1000
              ],
              "dims": [
                "dim_0"
              ],
              "resizable": false
            },
            "micro": {
              "endianness": "little",
              "kind": "f",
              "itemsize": 8
            }
          },
          "coords": null,
          "coord_names": [],
          "name": "z",
          "resizable": false
        },
        "micro": null
      }
    },
    "coords": {
      "x": {
        "macro": {
          "variable": {
            "macro": {
              "chunks": [
                [
                  1000
                ]
              ],
              "shape": [
                1000
              ],
              "dims": [
                "x"
              ],
              "resizable": false
            },
            "micro": {
              "endianness": "little",
              "kind": "f",
              "itemsize": 8
            }
          },
          "coords": null,
          "coord_names": [
            "x"
          ],
          "name": "x",
          "resizable": false
        },
        "micro": null
      },
      "y": {
        "macro": {
          "variable": {
            "macro": {
              "chunks": [
                [
                  1000
                ]
              ],
              "shape": [
                1000
              ],
              "dims": [
                "y"
              ],
              "resizable": false
            },
            "micro": {
              "endianness": "little",
              "kind": "f",
              "itemsize": 8
            }
          },
          "coords": null,
          "coord_names": [
            "y"
          ],
          "name": "y",
          "resizable": false
        },
        "micro": null
      }
    },
    "resizable": false
  }
}