Most active commenters

EGreg(10)
logical42(5)
necovek(4)

Caching is an abstraction, not an optimization

(buttondown.com)

Show context

ckdot2 ◴[03 Jul 25 19:03 UTC] No.44458190[source]▶

"I think now caching is probably best understood as a tool for making software simpler" - that's cute. Caching might be beneficial for many cases, but if it doesn't do one thing then this is simplifying software. There's that famous quote "There are only two hard things in Computer Science: cache invalidation and naming things.", and, sure, it's a bit ironical, but there's some truth in there.

replies(11): >>44458265 #>>44458365 #>>44458502 #>>44459091 #>>44459123 #>>44459372 #>>44459490 #>>44459654 #>>44459905 #>>44460039 #>>44460321 #

1. EGreg ◴[03 Jul 25 19:41 UTC] No.44458502[source]▶

>>44458190 #

I never understood about cache invalidation or naming things

Both are not that difficult, honestly.

Aren’t there a lot harder things out there

replies(8): >>44458592 #>>44458650 #>>44458692 #>>44458868 #>>44458913 #>>44459031 #>>44459481 #>>44459828 #

2. gryfft ◴[03 Jul 25 19:57 UTC] No.44458650[source]▶

>>44458502 (TP) #

It's a little bit tongue in cheek; no one is seriously suggesting it's harder than P=NP or the problem of consciousness. But there's something a bit "death and taxes" to the inevitability that any large enough project is going to have some corner cases involving these old chestnuts.

Heck you can probably prove that any system for naming things is either inconsistent or incomplete.

replies(1): >>44459457 #

3. Valodim ◴[03 Jul 25 20:03 UTC] No.44458692[source]▶

>>44458502 (TP) #

In my experience, the larger the software you write, the truer these become. At some point all obvious names will have collisions, and getting caching right is crucial to do but difficult to achieve because it transcends the entire stack.

You could group these two things into "getting the data model right" as the single hard thing, perhaps that rings more true to you :)

4. quuxplusone ◴[03 Jul 25 20:24 UTC] No.44458868[source]▶

>>44458502 (TP) #

For "only two hard problems," read "two candidates for among the hardest problems (but we feel strongly that these are indeed good candidates)," or something along those lines, more or less.

It's also possible that these used to be the only two hard problems at the time the aphorism was first recorded, but the underlying state of the world has changed since then and the aphorism, as recorded, is no longer current.

5. IshKebab ◴[03 Jul 25 20:31 UTC] No.44458913[source]▶

>>44458502 (TP) #

Cache invalidation isn't hard in theory. It's just one of those things that is very easy to get subtly wrong and difficult to test.

Think about all those times your program isn't building and `make clean` fixes it.

replies(1): >>44459653 #

6. gpderetta ◴[03 Jul 25 20:45 UTC] No.44459031[source]▶

>>44458502 (TP) #

Namings things is of course a bit tongue in cheek. But cache invalidation is hard. For example, allegedly MESI is one of the hardest things to validate in processor design.

7. TeMPOraL ◴[03 Jul 25 21:49 UTC] No.44459457[source]▶

>>44458650 #

> no one is seriously suggesting it's harder than P=NP or the problem of consciousness.

Well, I for one feel that "naming things" ultimately boils down to the latter, which may or may not be harder than the former.

8. TOGoS ◴[03 Jul 25 21:53 UTC] No.44459481[source]▶

>>44458502 (TP) #

There is a secret technique, called content-addressing[1], which elegantly solves both of them at once.

A lot of people haven't caught on, and try to cache things using ambiguous names, hence the struggle to invalidate their caches when the meaning changes.

[1] This can be applied even if you don't know the content yet; you just have to unambiguously name the inputs to the function that produces it. You might not know what all the inputs are, and then you have to start adding stuff like "unknown-unknown-2025-07-03T16", but it'll still basically work.

9. throwaway150 ◴[03 Jul 25 22:24 UTC] No.44459653[source]▶

>>44458913 #

> Think about all those times your program isn't building and `make clean` fixes it.

I don't think that's a good example. I've worked with more than 20 different build tools by now, and I cannot recall a single instance where the problem actually came down to cache invalidation. Every time I dug into it, the real cause was something else: a mistake in the build script, an incorrectly declared dependency, or something similar.

So when you say "think about all those times", not a single such time comes to my mind!

replies(2): >>44460506 #>>44463400 #

10. ninalanyon ◴[03 Jul 25 23:00 UTC] No.44459828[source]▶

>>44458502 (TP) #

Really? Have you tried building any substantial program that makes use of caching and succeeded in invalidating the cache both correctly and efficiently? It's not all about simple things like disk access, caching is also useful in software that models complex hardware where properties depend on multitudes of interconnected calculated values that are time consuming to calculate and where you cannot predict which ones the client will ask for next.

replies(2): >>44460212 #>>44461595 #

11. EGreg ◴[04 Jul 25 00:23 UTC] No.44460212[source]▶

>>44459828 #

Yes, I have. I've actually built a general-purpose cache system. Several, in fact:

Here's one for PHP: https://github.com/Qbix/Platform/blob/main/platform/classes/...

And here is for the Web: https://github.com/Qbix/Platform/blob/dc95bd85fa386e45546b0b...

I also discuss caching in the context of HTTP: https://community.qbix.com/t/caching-and-sessions/192

  WRITING ABOUT CACHING AND INVALIDATION

You should especially start here: https://community.qbix.com/t/qbix-websites-loading-quickly/2...

And then you can read about incremental updates: https://community.qbix.com/t/streams-plugin-messages-subscri...

replies(1): >>44460904 #

12. degamad ◴[04 Jul 25 01:50 UTC] No.44460506{3}[source]▶

>>44459653 #

I think the idea is that if make clean allows the build to complete correctly, then the underlying issue is that the "mistake in the build script, incorrectly declared dependency, or similar" was causing the cached build results to not be invalidated when they should have.

13. necovek ◴[04 Jul 25 03:20 UTC] No.44460904{3}[source]▶

>>44460212 #

Your implementations are not cache systems: they are key-value store abstractions which can be used for caching. There is a simpler one in most languages (IIRC, named "hashes" in PHP).

Caching becomes hard when such a stores are used in distributed or concurrent contexts.

An example: imagine Qcache being used when fetching data from an SQL database. And data in the database changes with a direct SQL query from another process.

How will your key-value store know that the value in it is stale and needs refreshing?

replies(1): >>44460947 #

14. EGreg ◴[04 Jul 25 03:33 UTC] No.44460947{4}[source]▶

>>44460904 #

Yes, they are cache systems. And as you say, they can be used for caching.

Caching systems know when something needs updating several ways. One is pubsub: When the information is loaded from the cache, you want to set up a realtime websocket or webhook for example, to reflect any changes since the cached info was shown. If you can manage to have a server running, you can simply receive new data (deltas) and update your cached data — in that case you can even have it ready and never stale by the time it is shown.

If you can’t run a server and don’t want to reveive pushes then another approach simply involves storing the latest ordinal or state for EACH cached item locally (e-tags does this) and then before rendering (or right after you do) bulk-sending the tags and seeing what has been updated, then pulling just that. The downside is that you may have a flash of old content if you show it optimistically.

If you combine the two methods, you can easily get an up-to-date syndication of a remote mutable data store.

My whole framework is built around such abstractions, to take care of them for people.

replies(2): >>44461090 #>>44461131 #

15. necovek ◴[04 Jul 25 04:16 UTC] No.44461090{5}[source]▶

>>44460947 #

We can agree to disagree on this being a "caching system" and it "knowing when something needs updating" (from the API, I am guessing it's a "set" method).

Phrases like "simply" and "never stale" are doing a lot of heavy lifting. Yet you haven't answered a very simple question I posted above: how would Q_Cache structure handle a direct SQL write query from another process on the database it is being used as a cache for?

SQL databases don't run websockets to external servers, nor do they fire webhooks. If you are running a server which you are hitting with every update instead of doing direct SQL on the DB, there's always a small amount of time where a cache user can see stale data before this server manages to trigger the update of the cache.

replies(2): >>44461123 #>>44462335 #

16. EGreg ◴[04 Jul 25 04:27 UTC] No.44461123{6}[source]▶

>>44461090 #

Simple. Use the principles I said above, for distributed and asynchronous systems.

The SQL server should have a way to do pubsub. It can then notify your PHP webhook via HTTP, or run a script on the command line, when a row changes. It should have an interface to subscribe to these changes.

If your SQL data store lacks the ability to notify others that a row changed, then there’s your main problem. Add that functionality. Make a TRIGGER in SQL for instance and use an extension.

If MySQL really lacks this ability then just put node.js middleware in front of it that will do this for you. And make sure that all mysql transactions from all clients go through that middleware. Now your combined MySQL+Node server is adequate to help clients avoid stale data.

As I already said, if you for some reason refuse to handle push updates, then you have to store the latest row state (etags) in your PHP cache (apcu). And then when you access a bunch of cached items on a request, at the end collect all their ids and etags and simply bulk query node.js for any cached items that changed. You can use bloom filters or merkle trees or prolly trees to optimize the query.

Joel Gustafson had a great article on this and now uses it in gossiplog: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...

replies(2): >>44461603 #>>44461641 #

17. ◴[04 Jul 25 04:29 UTC] No.44461131{5}[source]▶

>>44460947 #

18. logical42 ◴[04 Jul 25 06:10 UTC] No.44461595[source]▶

>>44459828 #

I agree with you, but you are omitting a key aspect of the problem which makes this hard. If you have a single server serving from a local sqlite database, caching and invalidating the cache is trivially easy.

It becomes way more difficult when you have N servers all of which could potentially serve the same data. Local caches could then, yes easily become stale, and even if you have a propagation mechanism, you couldn't guarantee against network failures or latency issues.

But suppose you have an HA redis instance that you store your cached data in. Even with a write-through cache, you basically need to implement a 2-phase commit to ensure consistency.

replies(1): >>44462375 #

19. logical42 ◴[04 Jul 25 06:12 UTC] No.44461603{7}[source]▶

>>44461123 #

Adding change data capture to a database isn't exactly trivial.

replies(1): >>44462249 #

20. tsimionescu ◴[04 Jul 25 06:19 UTC] No.44461641{7}[source]▶

>>44461123 #

You're just pushing much of the complexity to that notification layer.

What if an SQL update changes a million rows, and you had cached the sum of those million rows? Should it send a million notifications? Should it re-run the sum for you? What if it's a complex operation and it takes a while to re-compute, and another update arrives before the re-compute finishes?

And of course, you will always have some partitions, so you will occasionally need to re-query the data anyway and re-establish any kind of pubsub system. And the complexity in efficiently reconciling two views of a large amount of data is a major issue, that you are just brushing off as "you can do various things to maybe optimize the query".

replies(1): >>44462223 #

21. EGreg ◴[04 Jul 25 07:54 UTC] No.44462223{8}[source]▶

>>44461641 #

Yes, if you really want to show something as ambitious as a sum of a million rows, then you should absolutely move to an event-based, push-based architecture. Because once you’ve calculated the sum, the least work you would need to do is to periodically look at batches of UPDATES to rows, and update your sum based on that (rather than, say, running SUM again on the whole entire table). You can either pull a digest periodically, or — you can handle incremental updates by listening for pushes.

All the difficulties you cite, including the “optimizations” I mentioned, stem from not using push, and insisting on periodically polling something.

If you step back, your whole problem is that you are using a system that only has pull and not push architecture. You’re trying to put a bandaid on that core fact. You chose a relational database system that you refuse to build extensions for (as opposed to pretty much any other system) in order to insist “nyeh nyeh you havent solved cache invalidation”.

Caching is just one part of sync. The part that stores a non-authoritative, slightly out of date copy locally. And sync, or replication, is just one part of an eventually consistent system. I mean, even MySQL supports replication, so just hook your client up to that protocol (encrypted in transit) and boom, now you can update your cache.

Here’s the thing. If you use any network protocol that is open, eg HTTP, then yea it’s solved. Because you can have webhooks and just spin up a server to handle them, and update your cache. A row is removed? Decrement your sum. A row is added? Increment it.

You are just insisting that “no, our system has to be clunky and we refuse to implement webhooks / websockets / sockets / push / mysql replication client / updates of any kind, now solve it for me to be as good as a system which has push capabilities.”

22. EGreg ◴[04 Jul 25 07:59 UTC] No.44462249{8}[source]▶

>>44461603 #

Well then maybe consider making a sql replication client that would ingest the changes (like most databases, mysql writes to a straming append-only log, before it is compacted). Just parse the log and act on them.

Not trivial, really? Here, enjoy: https://github.com/krowinski/php-mysql-replication

replies(2): >>44476165 #>>44476178 #

23. EGreg ◴[04 Jul 25 08:13 UTC] No.44462335{6}[source]▶

>>44461090 #

I will answer your question directly, after having explained the core issue.

Mysql and Postgresql does in fact have push. It is called replication. The protocol is documented. So, much like a webhook with HTTP protocol, you can ingest changes on a table with a php long-running client and then simply update your apcu cache entries if they are still being stored, and matching that row. And all your PHP processes (managed by php-fpm or FrankenPHP or whatever) will benefit from that atomic apcu update.

There is nothing that says only another mysql server can be a mysql replication client.

Boom, solved. Even in your case. Note that the SQL database did in fact expect you to open a socket, and then it writes its log to that socket. That’s all the streaming events you need.

But now you’re gonna say “but but no you cant have a long running PHP client, we’re talking about a shared hosting environment where the host disables your ability to run a long PHP service that opens a socket”. Well, look. Most of the difficultu is that you’re intentionally trying to use the most clunky architectures just to “prove” that “cache invalidation is really hard.”

Shall we continue with Naming next? :)

replies(2): >>44463708 #>>44466341 #

24. EGreg ◴[04 Jul 25 08:19 UTC] No.44462375{3}[source]▶

>>44461595 #

See the thread below. Caches only become stale if you refuse to implement any sort of push architecture. And even then, frequent batch-polling with etags will quickly resolve it. (eg right after you used N entries, poll to see which of the N have changed).

Bloom filters, Prolly trees etc can speed the batch polling up. But push architecture obviates the need for it prevents caches from being stale, it’s called eventual consistency.

(Well, of course unless there is a network partition and you can’t communicate with the machine holding the actual source of truth then yeah, your local cache will be stale. And to catch up on missed updates you will need to request all updates since a certain marker you saved. If the upstream had a ton of updates since your marker and can’t send the entire history since your marker then you might have to purge your cache for that specific table / collection, yeah.).

replies(1): >>44476174 #

25. IshKebab ◴[04 Jul 25 10:59 UTC] No.44463400{3}[source]▶

>>44459653 #

> an incorrectly declared dependency

That literally IS cache invalidation. Incremental build systems are caching previous artefacts. They're supposed to invalidate the artefact if a dependency changes. But you forgot to declare the dependency... so it doesn't get invalidated when it should.

26. ◴[04 Jul 25 11:50 UTC] No.44463708{7}[source]▶

>>44462335 #

27. necovek ◴[04 Jul 25 17:25 UTC] No.44466341{7}[source]▶

>>44462335 #

I wasn't going to say any of that ;) Yes, Postgres does have WAL log you can subscribe to, but you are creating a replicated slave really. There are many nuances to doing that right and you are downplaying how hard that is (it's not about API, it's about ACID).

More importantly, you are pushing all the complex logic to this imaginary DB replication client: your "caching system" is still nothing more than a KV store.

replies(1): >>44467481 #

28. EGreg ◴[04 Jul 25 20:07 UTC] No.44467481{8}[source]▶

>>44466341 #

Well what are the “real” caching systems, if my system can be used for caching but isnt “real”? This feels like a no true scotsman argument.

replies(1): >>44467718 #

29. necovek ◴[04 Jul 25 20:41 UTC] No.44467718{9}[source]▶

>>44467481 #

Caching system would be what you described as a response to my example case: your key-value store combined together with the entire mechanism to watch for DB writes (eg with triggers or the replication log) and tie them back to cached values which need to have their lifetime managed.

30. logical42 ◴[05 Jul 25 22:41 UTC] No.44476165{9}[source]▶

>>44462249 #

That means they did implement change data capture, and only exposed it for a very specific use case.

31. logical42 ◴[05 Jul 25 22:44 UTC] No.44476174{4}[source]▶

>>44462375 #

I would love it if I could over-withdraw money from my bank account by having someone on the other side of the country take out money from my account at an ATM at the exact time as me, due to the fact that the bank's storage system was eventually consistent, but that's not the case.

32. logical42 ◴[05 Jul 25 22:44 UTC] No.44476178{9}[source]▶

>>44462249 #

You didn't implement change data capture. You're using theirs.

replies(1): >>44477231 #

33. EGreg ◴[06 Jul 25 02:11 UTC] No.44477231{10}[source]▶

>>44476178 #

I’m also not making my own programming language, I’m using PHP. And I’m using MySQL. What is your point?

Are you saying that “invalidating caches is hard… if you insist on using only systems with no way to push updates, and oh if they have a way to push updates then you’re using someone’s library to consume updates so it doesn’t count?”

↑