You're probably wrong about caching

Mike Solomon

There are only two hard things in Computer Science: cache invalidation and naming things –Phil Karlton

Caching is a great tool. Lots of useful data fits easily in memory–so cache it! Improve your latencies, ease the load on your database, reduce your hardware costs! Can you spell free lunch?

Many of the costs of caching aren’t paid up-front. This makes caching seem very attractive–and to be clear, there are many situations where caching your data is a great option–but if you’re just looking to pick up some “quick wins,” caching is a bad place to start.

I believe we, as software developers, have a very strong tendency to underestimate the complexities and issues caching brings along with it, especially when we look at oh-so-seductive early results of exploratory caching atop our data sources. In turn, I believe we often cache before the benefits actually outweigh the costs.

Fine, why do you think caching is so hard?

I’m glad you asked.

Reasoning about cached data is harder

Caching fundamentally means that you no longer read from your source of truth. Whenever you see something unexpected in your data (say, while debugging during an incident), you now must ask “does this data match our source of truth?” Every read or write to a piece of cached data is subject to mismatch with the source of truth and this must often be taken into account when tracking down issues.

A new class of perspective bugs are possible with cached data

Not all data appears in the same way to all users. For example, a list of “Best Articles” on a news site might depend on which user is logged in. A classic caching mistake is caching these perspective-dependent values and serving them to users who should have a different perspective. This is avoidable enough once, but is easy to mistakenly introduce later on. This can lead to serious privacy or even security issues.

Reproducing behavior involving caching is harder

When you introduce caching, you also introduce a new layer in which behavior can differ from your expectations. New race conditions are possible where they weren’t before: items can expire from the cache when you don’t expect them to, which objects are cached depends on access patterns that can vary by time of day or other factors. This means that issues can appear, but it is not obvious how to reproduce them to assist in fixing them.

Access pattern changes can subtly lower cache hit rates which damages performance

When access patterns change, so can performance. A special case of this can occur when data is fronted by a cache and access patterns change. As cache misses increase, latency increases and throughput can drop. However, traffic levels may stay the same, masking the cause, and potentially overloading the underlying data source. Like any issue, this can be dealt with, but it makes dealing with certain incidents more difficult.

In-process caching in garbage-collected languages increases GC pressure¹

This only applies to certain scenarios, like in-process caches on the JVM, but it serves as another example of how caching can introduce unexpected issues. In this case, large numbers of long-lived cached objects can get promoted into older generations of the garbage collector and increase both the run time of individual collections and the frequency at which they must happen.

Recovering from a failed cache is hard

Caches can let you scale up your serving capacity past what your underlying data source could serve alone, which is one reason to cache. Unfortunately, when your cache machines go down (or are unreachable on the network, or unresponsive, or…) you cannot simply bring them back online, as all data stored in memory will already be lost. You must warm your caches by reading from the underlying store while still trying to serve production traffic. Often your only choice will be to deny all but a fraction of traffic, slowly ramping up the amount you serve as your caches warm.

Is it worth it?

It depends; the tradeoffs are yours to choose between.

But before you choose, consider that many of the downsides won’t manifest themselves right away. Don’t forget how easy it is to ignore these downsides and focus only on the benefits–the payoff is immediate, but the costs must be paid constantly throughout the cache’s lifetime.

Garbage collector (GC) pressure happens when your application (running on the JVM, CLR, V8, and other garbage-collected runtimes) allocates then releases memory for many objects. The runtime must occasionally collect this garbage by walking through your memory graph to determine which objects are still needed, and which can be freed.

When your application produces and discards many objects in a short amount of time–say, from an in-memory cache–the garbage collector needs to interrupt your running code and collect dead objects more frequently, and each collection can take longer. ↩