Prep

Threads and Concurrency

Learning Objectives

Normally when we run a program, it runs one instruction at a time. It starts at the start of a main function and goes top-to-bottom (sometimes jumping into functions, and back out of them).

There are limits to how fast any one instruction can be run. Processors have stopped getting faster. Instead, we now add more processors to a computer. This allows a computer to do more than one thing at a time. But it’s much harder to tell a computer in what order things need to happen, or what things must/mustn’t happen at the same time.

Most practical projects, and almost all servers, use concurrency - running more than one thing at a time - to improve their performance. They start multiple threads which can independently perform work.

But this concurrency doesn’t come for free. It’s really easy to write incorrect concurrent code, and this often results in serious (and hard to debug!) bugs, or bottlenecks which fall back to single-threaded performance.

These kind of bugs are called race conditions🧢🧢 Race conditionA race condition happens when two (or more) pieces of code are trying to handle the same data at the same time, and some parts happen in an invalid order. . Race conditions are often complicated to work with, for instance they may only exhibit when a computer is very busy (or very idle), on certain computers, on certain operating systems, or just when you get very unlucky. Using proper concurrency controls can prevent these bugs.

Memory Models

Learning Objectives

As you will have learnt in Chapter 7 of How Computers Really Work, some memory is cheap but slow, and other memory is fast but expensive. CPU manufacturers add fast per-core caches in front of cheap system-wide memory. This causes problems when multiple cores try to share the same item in memory. Each core’s cache may get its own copy of that memory, and if one core changes the contents of that data, the other core won’t see that change.

You can see estimated speeds of different memory (and other) read operations in “Latency Numbers Every Programmer Should Know”. Read this and make sure you understand it. This suggests that reading from shared main memory is 100-1000x slower than reading from a per-core L1 cache.

Through this track, we will will introduce different ways we can handle this problem.

Intro Reading

Learning Objectives

Reading

If you are doing this track in Go, read the concurrency section of golangbot.com (i.e. chapters 20-25). Run the examples and make sure you understand their output.

If you are doing this track in Java, read the Java Concurrency Tutorial. Run the examples and make sure you understand their output.

Single-variable Concurrency

Learning Objectives

Let’s look at the example buggy code (taken from the Mutexes chapter of golangbot - there are tabs with the same code in different languages):

package main

import (
	"fmt"
	"sync"
)

var x = 0

func increment(wg *sync.WaitGroup) {
	x = x + 1
	wg.Done()
}

func main() {
	var w sync.WaitGroup
	for i := 0; i < 1000; i++ {
		w.Add(1)
		go increment(&w)
	}
	w.Wait()
	fmt.Println("final value of x", x)
}
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

class Main {
  static final class IntegerHolder {
    public int value = 0;
  }

  public static void main(String[] args) throws Exception {
    final IntegerHolder holder = new IntegerHolder();

    List<Future<?>> futures = new ArrayList<>();

    // Make a threadpool of 100 threads.
    ExecutorService executorService = Executors.newFixedThreadPool(100);

    // Submit 10000 tasks to run on those 100 threads.
    for (int i = 0; i < 1000; i++) {
      futures.add(executorService.submit(() -> {
        holder.value = holder.value + 1;
      }));
    }

    // Wait for all of the tasks to finish.
    for (Future<?> future : futures) {
      future.get();
    }

    // Stop the threadpool.
    executorService.shutdown();

    // Print out the value of the integer at the end - naively, we expect this to be 1000.
    System.out.println(holder.value);
  }
}

In an ideal world, this program would always output 1000.

Exercise

Extract the main function into a new function, and change main to call that function 5 times. Observe the results - the runs show different numbers.

The root cause is that multiple threads are handling the same memory location, and nothing is making sure they don’t get in each other’s way.

The statement x = x + 1 (or holder.value = holder.value + 1 in Java) does three operations:

  1. Look up the value at the memory location of x
  2. Compute (that looked-up value) + 1
  3. Assign (that computed value) to the memory location of x

A problem here is that other operations from other threads may happen between these operations.

Two threads may both read 0 from step 1, compute 1 from step 2, and try to write 1 in step 2 (they did the same work).

Or one thread may read 0, all other 999 threads may do their work as we hope (incrementing x one at a time in turn to 999), then the first thread may do its step 3 and write 1 back to x, undoing the work of the other threads.

sequenceDiagram Thread0->>x: What's your value? x->>Thread0: 0 Note right of Thread0: Other threads happen to get scheduled
while Thread0 does its addition... Thread1->>x: What's your value? x->>Thread1: 0 Thread1->>Thread1: 0 + 1 = 1 Thread1->>x: Now you're 1 Thread2->>x: What's your value? x->>Thread2: 1 Thread2->>Thread2: 1 + 1 = 2 Thread2->>x: Now you're 2 Thread3->>x: What's your value? x->>Thread3: 2 Thread3->>Thread3: 2 + 1 = 3 Thread3->>x: Now you're 3 Note right of x: And so on 999 times... Thread0->>Thread0: 0 + 1 = 1 Thread0->>x: Now you're 1 Note left of x: Oh no! All that work was undone!

There are three families of solutions to this problem:

  1. We can make sure no other thread is allowed to start doing these operations until one thread has completed all three. For this we would generally use a lock (also called a mutex).
  2. We can make sure all the operations actually happen from one thread, sending messages of work to do to that one thread to actually perform.
  3. We can ask the CPU to do one operation which will achieve all three effects without allowing any gaps between them.

These general approaches are the same in all languages, but how we do them is a bit different:

golangbot shows the locking solution by using a sync.Mutex.

golangbot also shows the single-thread solution by using a channel.

golangbot doesn’t show the “single operation” approach (using atomics) - we will explore this soon.

In Java, we used the synchronized keyword to acquire a lock.

In Java, we’d typically use a MPSC (“Multi-Producer, Single-Consumer”) queue to send messages to a single thread. There isn’t one built into Java, but there are libraries which provide these (or we could build one ourselves on top of a thread-safe queue collections such as ConcurrentLinkedQueue).

In Java, we would use atomic number types for the “single operation” approach.

Atomics

Learning Objectives

The problem we’ve identified is that we may end up using a stale version of the x variable. After we look it up, it may change before we write back our new value.

Atomic operations allow us to avoid this situation.

“Atomic” means “as one unit which can’t be broken down”. An atomic operation can’t be broken down into smaller operations.

Go exposes atomic operations in the sync/atomic package. The sync/atomic.Int32 type exposes an Add function which does the same as += does to int, but looks a little different. Let’s compare:

var x int
x += 1
var x atomic.Int32
newX := x.Add(1)

Java exposes atomic operations in the java.util.concurrent.atomic package. The AtomicInteger type exposes an addAndGet function (as well as getAndAdd, incrementAndGet, and getAndIncrement, among others). addAndGet does the same as += does to an int, but looks a little different. Let’s compare:

int x = 0;
x += 1;
AtomicInteger x = new AtomicInteger(0);
x.addAndGet(1);

Both of these do the same thing. They start x with value 0, they add one to it, and at the end x will have the value 1.

The atomic version, though, does so as one operation. It’s impossible for another thread to do something in the middle of the steps. The non-atomic version doesn’t.

This means that if there are other threads involved, the atomic version will always store back to x exactly 1 more than its value at the time of execution. It also returns a value - the value that was stored back. It does this because if we’re using atomics, we know there are probably other threads involved, so we know we can’t just read x and know what the result was - it may have been changed since we stored it.

Most atomic operations are implemented using Compare And Swap (“CAS”) operations. In fact, a lot of other synchronisation primitives (e.g. Mutexes) are also built on top of CAS operations.

Exercise

Try to modify the buggy code above to use atomics. Make sure that if you run it a few times, you always end up with 1000.
Memory barriers / fences

Under the covers, something involved in running our program (possibly your language’s standard library, possibly something in the operating system, possibly even something in the CPU) is inserting something called a memory barrier (also known as a fence). This is an instruction to say “it’s important that this operation is visible to other cores”, which forces different cores to synchronise their views of a piece of memory before continuing.

This can be slow. Recall that we have these per-core caches because they’re so much faster than sharing memory across cores. Every time we need to insert one of these memory barriers, we slow down our program. This is why memory access isn’t all atomic by default. But it can be worth it.

When are atomics good/bad to use?

Atomics can be really useful when you need to operate on a single item of small data (e.g. a number). They are simple and easy to reason about. But they typically operate on one item in memory. If you need to operate on larger data (e.g. a string), multiple pieces of memory (e.g. two different numbers which must be changed together), or perform an operation not supported by atomics (e.g. multiplication), you probably can’t use an atomic for that.

When using atomics, your threads are never blocked, and never need to wait for each other. Mutexes, on the other hand, block threads, which means that even though you may have started 100 threads to do work concurrently, only one will actually be doing anything (if they all need the same lock). This means that atomics can provide better throughput than other synchronisation primitives.

Mutexes

Learning Objectives

In our reading, we’ve seen locks (also called mutexes) used instead of an atomic to solve the same problem. Mutexes are more flexible - you choose when you lock and unlock them, but more complicated - you need to make sure you unlock them at the right place. They are also prone to deadlock: Blocking your programme from making progress because two threads are waiting on locks held by each other.

If you have multiple operations which need to appear as if they were atomic (i.e. you must prevent other threads from taking action between the operations), a mutex is often a good way to achieve this.

Critical Sections

When using a mutex, it’s crucial to work out where the mutex needs to be acquired (“locked”) and where it needs to be released (“unlocked”).

In general, any time that some other thread changing something between operations would cause a problem, those operations need to be performed under the same lock acquisition.

Let’s take an example of tracking some statistics. Imagine we have a counter for total number of requests, and counters for successful and failed requests. We may want the invariant to always hold: (successful requests + failed requests) = total requests.

This code would be buggy:

type customerStats struct {
	failedReqs  int
	successReqs int
}

var (
	lock            sync.Mutex
	statsByCustomer = make(map[string]*customerStats)
)

func updateCustomerStats(customerId string, ok bool) {
	// Does customer stats object exist in map?
	lock.Lock()
	_, exists := statsByCustomer[customerId]
	lock.Unlock()

	// Create stats obj if necessary
	if !exists {
		lock.Lock()
		statsByCustomer[customerId] = &customerStats{}
		lock.Unlock()
	}

	lock.Lock()
	if ok {
		statsByCustomer[customerId].successReqs++
	} else {
		statsByCustomer[customerId].failedReqs++
	}
	lock.Unlock()
}
import java.util.HashMap;

class AllCustomerStats {
  static final class IndividualCustomerStats {
    public int failedReqs = 0;
    public int successReqs = 0;
  }

  // Guards access to statsByCustomer.
  Object lock = new Object();

  // Guarded by lock.
  HashMap<String, IndividualCustomerStats> statsByCustomer = new HashMap<>();

  public void updateCustomerStats(String customerId, boolean success) {
    boolean alreadyExists;
    synchronized(lock) {
      alreadyExists = statsByCustomer.containsKey(customerId);
    }

    if (!alreadyExists) {
      synchronized(lock) {
        statsByCustomer.put(customerId, new IndividualCustomerStats());
      }
    }

    synchronized(lock) {
      if (success) {
        statsByCustomer.get(customerId).successReqs++;
      } else {
        statsByCustomer.get(customerId).failedReqs++;
      }
    }
  }
}

This is because we release the lock between two things we want to appear as atomic. In this code:

go func() {
    updateCustomerStats("gina", true)
}()

go func() {
    updateCustomerStats("gina", false)
}()
class Main {
  public static void main(String[] args) throws Exception {
    AllCustomerStats stats = new AllCustomerStats();
    ArrayList<Thread> threads = new ArrayList<>();

    threads.add(new Thread(() -> { stats.updateCustomerStats("gina", true); }));
    threads.add(new Thread(() -> { stats.updateCustomerStats("gina", false); }));

    for (Thread thread : threads) {
      thread.start();
    }
    for (Thread thread : threads) {
      thread.join();
    }

    AllCustomerStats.IndividualCustomerStats ginaStats = stats.statsByCustomer.get("gina");
    System.out.printf("Total requests: %d%n", ginaStats.failedReqs + ginaStats.successReqs);
  }
}

both threads may do the “exists” check, see there’s no stats for that customer, and create a new empty stats object. Both will write the new stats object into the map, and one of them will overwrite the other, discarding the other thread’s data.

Instead we need to hold the lock across all of the operations that need to appear as atomic:

    lock.Lock()
    // Does customer stats object exist in map?
    _, exists := statsByCustomer[customerId]

    // Create stats obj if necessary
    if !exists {
        statsByCustomer[customerId] = &customerStats{}
    }
    lock.Unlock()
    synchronized(lock) {
      boolean alreadyExists = statsByCustomer.containsKey(customerId);
      if (!alreadyExists) {
        statsByCustomer.put(customerId, new IndividualCustomerStats());
      }
    }

The ++ operation is also a thread-safety risk as we’ve seen before. But it doesn’t fall into the same critical section as the “check if exists, and maybe write to map” sequence. We could solve it using an atomic, a different lock, or the same lock. All of these are fine solutions to the ++ race condition. But it doesn’t fall into the critical section, because it doesn’t need to be guarded by the same lock acquisition.

This is particularly important to consider when code is spread across functions. This code has a similar bug as above:

type server struct {
    mu sync.Mutex
    totalRequests int
    successfulRequests int
    failedRequests int
}

func (s *server) recordResult(isSuccess bool) {
    s.mu.Lock()
    defer s.mu.Unlock()
    if isSuccess {
        s.successfulRequests += 1
    } else {
        s.failedRequests += 1
    }
}

func (s *server) handleRequest() {
    s.mu.Lock()
    s.totalRequests += 1
    s.mu.Unlock()
    s.recordResult(true)
}
class Server {
  Object lock = new Object();

  public int totalRequests = 0;
  public int successfulRequests = 0;
  public int failedRequests = 0;

  void recordResult(boolean isSuccess) {
    synchronized(lock) {
      if (isSuccess) {
        successfulRequests++;
      } else {
        failedRequests++;
      }
    }
  }

  void handleRequest() {
    synchronized(lock) {
      totalRequests++;
    }
    recordResult(true);
  }
}

In the above case, the number of total requests isn’t modified under the same lock as the success/failed requests. This means if some other code computed a success rate by dividing the number of successful requests by the total number of requests, it may show <100% success when 100% of requests were successful!

πŸ“Note

In Java, this is also one of the reasons many people always used synchronized(lock) blocks, rather than adding the synchronized keyword to a method definition.

In Java you can write:

synchronized void recordResult(booleanIsSuccess) and synchronized void handleRequest() which automatically acquire the lock on this for the duration of the function. But often this leads to bugs because people may need to hold the same lock over multiple method calls, which this pattern doesn’t allow.

Deadlock

Learning Objectives

A deadlock occurs when two or more threads are blocked forever, each waiting for each other.

One way this can happen is if two threads need to acquire two locks, but they acquire them in different orders. For instance:

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	var mu1 sync.Mutex
	var mu2 sync.Mutex

	var wg sync.WaitGroup
	wg.Add(2)

	go func() {
		mu1.Lock()
		defer mu1.Unlock()

		time.Sleep(time.Second)

		mu2.Lock()
		defer mu2.Unlock()

		fmt.Println("Goroutine 1 succeeded")
		wg.Done()
	}()

	go func() {
		mu2.Lock()
		defer mu2.Unlock()

		time.Sleep(time.Second)

		mu1.Lock()
		defer mu1.Unlock()

		fmt.Println("Goroutine 2 succeeded")
		wg.Done()
	}()

	wg.Wait()
}
import java.util.ArrayList;

class Main {
  public static void main(String[] args) throws Exception {
    Object lock1 = new Object();
    Object lock2 = new Object();

    ArrayList<Thread> threads = new ArrayList<>();

    threads.add(new Thread(() -> {
      synchronized(lock1) {
        try { Thread.sleep(1000); } catch (InterruptedException e) {}
        synchronized(lock2) {
          System.out.println("Thread 1 succeeded");
        }
      }
    }));

    threads.add(new Thread(() -> {
      synchronized(lock2) {
        try { Thread.sleep(1000); } catch (InterruptedException e) {}
        synchronized(lock1) {
          System.out.println("Thread 2 succeeded");
        }
      }
    }));

    for (Thread thread : threads) {
      thread.start();
    }
    for (Thread thread : threads) {
      thread.join();
    }
  }
}

Whichever thread acquired one lock first, it’s going to wait for the other lock, which the other thread has already acquired.

These examples are artificial, but they illustrate what can happen in real life.

Re-entrance

Learning Objectives

In some languages/libraries, locks are re-entrant. This means that if your thread already holds a particular lock, and tries to acquire it again, it will be allowed to do so.

In other languages/libraries, locks are not re-entrant. This means that if a thread tries to acquire a lock which is already acquired, it will block, no matter what thread already holds it.

Go locks are not re-entrant

In Go, this code would cause a deadlock, because one thread would be trying to acquire a lock it already holds. Notice that handleRequest calls recordResult while already holding the lock that recordResult will also try to acquire.

type server struct {
    mu sync.Mutex
    totalRequests int
    successfulRequests int
    failedRequests int
}

func (s *server) recordResult(isSuccess bool) {
    s.mu.Lock()
    defer s.mu.Unlock()
    if isSuccess {
        s.successfulRequests += 1
    } else {
        s.failedRequests += 1
    }
}

func (s *server) handleRequest() {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.totalRequests += 1
    s.recordResult(true)
}

One pattern that’s often used is to have some functions require a lock is held in order to call them - something like:

type server struct {
    mu sync.Mutex
    totalRequests int
    successfulRequests int
    failedRequests int
}

// s.mu must be held to call this function.
// The compiler won't guarantee this is the case, but hopefully developers will see the name and the comment and do the right thing.
func (s *server) recordResult_locked(isSuccess bool) {
    if isSuccess {
        s.successfulRequests += 1
    } else {
        s.failedRequests += 1
    }
}

func (s *server) handleRequest() {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.totalRequests += 1
    s.recordResult_locked(true)
}

Java locks are re-entrant

The equivalent Java code would not deadlock, because its locks are re-entrant. When handleRequest calls recordResult, it will be allowed to go inside the synchronized block because the same thread already holds that lock.

class Server {
  Object lock = new Object();

  public int totalRequests = 0;
  public int successfulRequests = 0;
  public int failedRequests = 0;

  void recordResult(boolean isSuccess) {
    synchronized(lock) {
      if (isSuccess) {
        successfulRequests++;
      } else {
        failedRequests++;
      }
    }
  }

  void handleRequest() {
    synchronized(lock) {
      totalRequests++;
      recordResult(true);
    }
  }
}

Java can still get deadlock, however, for instance if some code needs to acquire more than one lock, or needs to access the same lock from multiple threads.

Project: Cache with Stats

Learning Objectives

You are going to implement a cache.

A cache is used to store some result (probably because it was in some way expensive to compute/fetch) so that it can be looked up more cheaply in the future.

Caches (like many problems) involves trade-offs. We are using some memory (by storing results) to save some other resource in the future (often compute time).

Note that caches can get very sophisticated - we’re going to write quite a simple one.

Often times, when we use a cache, we want to limit how much memory it will use. This means we need to sometimes delete things from the cache. There are different policies for cache eviction. The one we are going to pick is “least recently used” (or LRU).

Imagine we have a limit of 3 items in our cache, and we already have three items in it. If we try to add an additional item, first, we need to remove one. We need to sort the entries by when they were most recently added or accessed, and remove the oldest.

We also want to keep statistics about our cache. We want to be able to ask the following questions:

  1. What’s the hit rate of our cache? i.e. when we tried to look up a value, how many times was it in the cache vs missing?
  2. How many entries were written to the cache and have never been read after being written (including for things which have been evicted)?
  3. What is the average number of times a cache entry is read (just for things currently in the cache)?
  4. How many reads and writes have been performed in the cache across all time (including for things which have been evicted)?

We want these statistics to be strongly consistent (e.g. if we only ever write to the cache and never read from it, the answers to 1 and 3 should be the same).

Note that this desire for strong consistency will shape our API. It means we probably want our statistics function to return a struct with all of the values, rather than individual methods to return each one, so that we can collect them all in one critical section.

Part of the interface for the cache should look like this:

func NewCache[K comparable, V any](entryLimit int) Cache[K, V] { ... }

// Put adds the value to the cache, and returns a boolean to indicate whether a value already existed in the cache for that key.
// If there was previously a value, it replaces that value with this one.
// Any Put counts as a refresh in terms of LRU tracking.
func (c *Cache[K, V]) Put(key K, value V) bool { ... }

// Get returns the value assocated with the passed key, and a boolean to indicate whether a value was known or not. If not, nil is returned as the value.
// Any Get counts as a refresh in terms of LRU tracking.
func (c *Cache[K, V]) Get(key K) (*V, bool) { ... }
import java.util.Optional;

public class Cache<K, V> {
  public Cache(int entryLimit) {}

  /** Adds the value to the cache, and returns a boolean to indicate whether a value already existed in the cache for that key.
   *
   * If there was previously a value, it replaces that value with this one.
   * Any Put counts as a refresh in terms of LRU tracking.
   */
  public boolean put(K key, V value) {}

  /**
   * Gets a value from the cache.
   *
   * If the value was present, returns an Optional.of that value.
   * If the value was not present, returns an Optional.empty.
   *
   * Any get counts as a refresh in terms of LRU tracking.
   */
  public Optional<V> get(K key) {}
}

Implement this cache and write tests that show it is safe for concurrent use. (Note that these tests are hard to exhaustively write, but see what you can do).

πŸ“Note

There is tooling to help detect race conditions, for instance Go’s built in race dectector and Infer’s RacerD for Java. You may find these useful to help you find bugs.

Optimising Locks

Learning Objectives

By design, using locks limits how much we can do in parallel. Sometimes this slows things down too much!

After you have successfully written your cache, see if you can speed it up. Here are a few ideas you may want to research and experiment with (all of which are likely to depend on the usage pattern people have of your cache):

  • Using an RWMutex (Go) or ReadWriteLock (Java). When to use a Mutex vs an RWMutex is a trade-off! Make sure you can answer the question “When should I prefer an RwMutex and when should I prefer a Mutex?”. Earlier we talked about why all memory operations aren’t atomic. Think about the question “Why does Mutex exist at all if RWMutex exists?”
  • How many locks do you have? If you just have one for the whole cache, that means that any operation on the cache may be blocked by another. Could you have more than one lock to isolate changes so you’re less likely to be blocked by an unrelated operation? Remember the importance of considering the critical section when thinking about this.
  • Are there some guarantees we can loosen to require less locking (or locking at less busy times)? As an example, when do you remove old entries? Could you remove them in batches, or in a background goroutine, to avoid doing extra work on the write path? Could copying data under a read lock and then manipulating a copy allow you to do less work under a write lock? What trade-offs and risks do different approaches introduce?

Computing Cache

Learning Objectives

Some caches take control over the computation of values from keys. e.g. they may have an interface like:

func NewCache[K comparable, V any](entryLimit int, creator func(K) V) Cache[K, V] { ... }

func (c *Cache[K, V]) Get(key K) V { ... }
import java.util.function.Function;

public class Cache<K, V> {
  public Cache(int entryLimit, Function<K, V> creator) {}

  public V get(K key) {}
}

Where there is no explicit Put function, but if you Get a missing element the cache will compute it for you, write it into the cache, and return the stored value.

This interface has some interesting differences in terms of concurrency. Some things to consider:

  1. How long may creator take to run? If a while, we probably don’t want to hold a lock during the entire call.
  2. We probably don’t want to run creator more than once for any particular key. How can we avoid this?

Try implementing this. Note: You don’t need to use exactly this interface, just the idea that Get may Put as a side-effect. Hint: Channels/MPSC queues may be useful!

Comparing Implementations

Learning Objectives

It’s important to be able to read code, and to consider the trade-offs when reading and writing code.

When you’ve implemented the cache projects, have a read of the sample solutions (note: these are currently only available in Go). There are four different implementations. With your own implementation, that’s at least five. Each is valid and works. Each has different trade-offs and is better suited for different constraints or use-cases.

Try to answer at least the following questions:

  • What are the differences between each implementation? For example:
    • How much work needs to be done to evict an entry? (What’s the big-O of evicting an entry?)
    • How much contention is there for locks when putting in the cache?
    • How much contention is there for locks when reading from the cache?
  • For each implementation, what use-cases is this implementation well-suited to? What use-cases would this implementation be particularly bad at? When would one of the other solutions be better? Some things to generally think about when considering trade-offs:
    • What is the: average memory consumption, peak memory consumption, average-case latency, worst-case latency, consistency of latency, for each implementation?
    • Think about usage patterns: which is better if callers are mostly reading? Mostly writing? Always reading? Alternating between reads and writes? Other usage patterns?
    • What guarantees can we offer with some implementations but not others? e.g. If we need to never use more than 10 MB of RAM, which implementations can guarantee that?