←back to thread

134 points samuel246 | 1 comments | | HN request time: 0.285s | source
1. 0xbadcafebee ◴[] No.44461229[source]
Sometimes posts are so difficult to read they're hard to respond to. I think I get what they're saying. I think they're saying that they think caching should be simple, or at least, that it should be obvious how you should cache in your particular situation such that you don't need things like algorithms. But that argument is kind of nonsense, because really everything in software is an algorithm.

Caching is storing a copy of data in a place or way that it is faster to retrieve than it would be otherwise. Caching is not an abstraction; it is a computer science technique to achieve improved performance.

Caching does not make software simpler. In fact, it always, by necessity, makes software more complex. For example, there are:

  - Routines to look up data in a fast storage medium
  - Routines to retrieve data from a slow storage medium and store them in a fast storage medium
  - Routines to remove the cache if an expiration is reached
  - Routines to remove cache entries if we run out of cache storage
  - Routines to remove the oldest unused cache entry
  - Routines to remove the newest cache entry
  - Routines to store the age of each cache entry access
  - Routines to remove cache entries which have been used the least
  - Routines to remove specific cache entries regardless of age
  - Routines to store data in the cache at the same time as slow storage
  - Routines to store data in cache and only write to slow storage occasionally
  - Routines to clear out the data and get it again on-demand/as necessary
  - Routines to inform other systems about the state of your cache
  - ...and many, many more
Each routine involves a calculation that determines whether the cache will be beneficial. A hit or miss can lead to operations which may add or remove latency, may or may not run into consistency problems, may or may not require remediation. The cache may need to be warmed up, or it may be fine starting cold. Clearing the cache (ex. restarts) may cause such a drastic cascading failure that the system cannot be started again. And there is often a large amount of statistics and analysis needed to optimize a caching strategy.

These are just a few of the considerations of caching. Caching is famously one of the hardest problems in computer science. How caching is implemented, and what it affects, can be very complex, and needs to be considered carefully. If you try to abstract it away, it usually leads to problems. Though if you don't try to abstract it away, it also leads to problems. Because of all of that, abstracting caching away into "general storage engine" is simply impossible in many cases.

Caching also isn't just having data in fast storage. Caching is cheating. You want to provide your data faster than actually works with your normal data storage (or transfer mechanism, etc). So you cheat, by copying it somewhere faster. And you cheat again, by trying to figure out how to look it up fast. And cheat again, by trying to figure out how to deal with its state being ultimately separate from the state of the "real" data in storage.

Basically caching is us trying to be really clever and work around our inherent limitations. But often we're not as smart as we think we are, and our clever cheat can bite us. So my advice is to design your system to work well without caching. You will thank yourself later, when you finally are dealing with the bug bites, and realize you dodged a bullet before.