Control Speed, Memory Usage, and Disk Usage

There are no solutions, only trade-offs.

We will build Trees with a variety of trade-offs in speed and resource usage. Any of these may be a good solution depending the specifics of your situation. They are presented in order of increasing complexity.

Everything in memory

import numpy
from tiled.readers.array ArrayAdapter
from tiled.trees.in_memory import Tree

# Generate data and store it in memory.
a = numpy.random.random((100, 100))
b = numpy.random.random((100, 100))
tree = Tree(
    {
        "a": ArrayAdapter.from_array(a),
        "b": ArrayAdapter.from_array(b),
    }
)
  • Server startup is slow because all data is generated or read up front.

  • Data access is fast because all the data is ready in memory.

  • The machine running the server must have sufficient RAM for all the entries in the Tree. This is not a scalable solution for large Trees with large data.

Load on first read

For the next example we’ll use a neat dictionary-like object. It is created by mapping keys to functions. The first time a given item is accessed, the function is called to generate the value, and the result is stashed internally for next time.

>>> from tiled.utils import OneShotCachedMap
>>> m = OneShotCachedMap({"a": lambda: 1, "b": lambda: 2})
>>> m["a"]  # value is computed on demand, so this could slow
1
>>> m["a"]  # value is returned immediately, remembered from last time
1
>>> dict(m)  # can be converted to an ordinary dict
{"a": 1, "b": 2}

Notice that one the object is created, it acts just like a normal Python mapping. Downstream code does not have to do anything special to work with it.

It can be integrated with Tree directly, just replacing the dictionary in the first example with a OneShotCachedMap.

import numpy
from tiled.utils import OneShotCachedMap
from tiled.readers.array ArrayAdapter
from tiled.trees.in_memory import Tree

# Use OneShotCachedMap which maps keys to *functions* that are
# run when the data is fist accessed.
tree = Tree(
    OneShotCachedMap(
        {
            "a": lambda: ArrayAdapter.from_array(numpy.random.random((100, 100))),
            "b": lambda: ArrayAdapter.from_array(numpy.random.random((100, 100))),
        }
    )
)
  • Server startup is fast because nothing is generated or read up front.

  • The first access for each item is slow because the data is generated or read on demand. Subsequent access is fast.

  • The machine running the server must have sufficient RAM for all the entries in the Tree. The memory usage will grow monotonically as items are accessed and stashed internally by OneShotCachedMap. This is not a scalable solution for large Trees with large data.

Load on first read and stash for awhile (but not forever)

We’ll use another neat dictionary-like object very much like the previous one. These two objects behave identically…

from tiled.utils import CachingMap, OneShotCachedMap

OneShotCachedMap({"a": lambda: 1, "b": lambda: 2})
CachingMap({"a": lambda: 1, "b": lambda: 2}, cache={})

…except that CachingMap can regenerate values if need be and allows us to control which values it stashes and for how long. For example, using the third-party library cachetools we can keep up to N items, discarding the least recently used item when the cache is full.

pip install cachetools
from cachetools import LRUCache

CachingMap({"a": lambda: 1, "b": lambda: 2}, cache=LRUCache(1))

As another example, we can keep items for up to some time limit. This enforces a maximum size as well, so that the cache cannot grow too large if it is receiving many requests within the specified time window.

from cachetools import LRUCache

# "TTL" stands for "time to live", measured in seconds.
CachingMap({"a": lambda: 1, "b": lambda: 2}, cache=TTLCache(maxsize=100, ttl=10))

If we keep a reference to our cache before passing it in, then other parts of the program can explicitly evict items to force an update.

cache = TTLCache(maxsize=100, ttl=10)
CachingMap({"a": lambda: 1, "b": lambda: 2}, cache=cache)

# Meanwhile...elsewhere in the code....
cache.pop("a")  # Force a refresh of this item next time it is accessed from CachingMap.

CachingMap integrates with Tree exactly the same as the others.

from cachetools import TTLCache
import numpy
from tiled.utils import CachingMap
from tiled.readers.array ArrayAdapter
from tiled.trees.in_memory import Tree

# Use CachingMap which again maps keys to *functions* that are
# run when the data is fist accessed. The values may be cached
# for a time.
tree = Tree(
    CachingMap(
        {
            "a": lambda: ArrayAdapter.from_array(numpy.random.random((100, 100))),
            "b": lambda: ArrayAdapter.from_array(numpy.random.random((100, 100))),
        },
        cache=TTLCache(maxsize=100, ttl=10),
    ),
)
  • Server startup is fast because nothing is generated or read up front.

  • The first access for each item is slow because the data is generated or read on demand. Later access may be fast or slow depending on whether the item has been evicted from the cache.

  • With an appropriately-scaled cache, this is scalable for large Trees with large data.

Proxy data from a network service and keep an on-disk cache

TO DO: Add fsspec example — How to put an upper bound on the disk cache?

Load keys dynamically as well as values

In all the previous examples, the keys of the Tree were held in memory, in Python, and the values were sometimes generated on demand. For Tree at the scale of thousands of entries, backed by a database or a web service, we’ll need to generate the keys on demand too. At that point, we escalate to writing a custom class satisfying the Tree specification, which fetches keys and values on demand by making queries specific to the underlying storage.

TO DO: Simplify Bluesky MongoDB example into MVP and add it here.