Caching is an abstraction, not an optimization

(buttondown.com)

135 points samuel246 | 1 comments | 01 Jul 25 10:42 UTC | HN request time: 0s | source

Show context

ckdot2 ◴[03 Jul 25 19:03 UTC] No.44458190[source]▶

"I think now caching is probably best understood as a tool for making software simpler" - that's cute. Caching might be beneficial for many cases, but if it doesn't do one thing then this is simplifying software. There's that famous quote "There are only two hard things in Computer Science: cache invalidation and naming things.", and, sure, it's a bit ironical, but there's some truth in there.

replies(11): >>44458265 #>>44458365 #>>44458502 #>>44459091 #>>44459123 #>>44459372 #>>44459490 #>>44459654 #>>44459905 #>>44460039 #>>44460321 #

EGreg ◴[03 Jul 25 19:41 UTC] No.44458502[source]▶

>>44458190 #

I never understood about cache invalidation or naming things

Both are not that difficult, honestly.

Aren’t there a lot harder things out there

replies(8): >>44458592 #>>44458650 #>>44458692 #>>44458868 #>>44458913 #>>44459031 #>>44459481 #>>44459828 #

ninalanyon ◴[03 Jul 25 23:00 UTC] No.44459828[source]▶

>>44458502 #

Really? Have you tried building any substantial program that makes use of caching and succeeded in invalidating the cache both correctly and efficiently? It's not all about simple things like disk access, caching is also useful in software that models complex hardware where properties depend on multitudes of interconnected calculated values that are time consuming to calculate and where you cannot predict which ones the client will ask for next.

replies(2): >>44460212 #>>44461595 #

EGreg ◴[04 Jul 25 00:23 UTC] No.44460212[source]▶

>>44459828 #

Yes, I have. I've actually built a general-purpose cache system. Several, in fact:

Here's one for PHP: https://github.com/Qbix/Platform/blob/main/platform/classes/...

And here is for the Web: https://github.com/Qbix/Platform/blob/dc95bd85fa386e45546b0b...

I also discuss caching in the context of HTTP: https://community.qbix.com/t/caching-and-sessions/192

  WRITING ABOUT CACHING AND INVALIDATION

You should especially start here: https://community.qbix.com/t/qbix-websites-loading-quickly/2...

And then you can read about incremental updates: https://community.qbix.com/t/streams-plugin-messages-subscri...

replies(1): >>44460904 #

necovek ◴[04 Jul 25 03:20 UTC] No.44460904[source]▶

>>44460212 #

Your implementations are not cache systems: they are key-value store abstractions which can be used for caching. There is a simpler one in most languages (IIRC, named "hashes" in PHP).

Caching becomes hard when such a stores are used in distributed or concurrent contexts.

An example: imagine Qcache being used when fetching data from an SQL database. And data in the database changes with a direct SQL query from another process.

How will your key-value store know that the value in it is stale and needs refreshing?

replies(1): >>44460947 #

EGreg ◴[04 Jul 25 03:33 UTC] No.44460947[source]▶

>>44460904 #

Yes, they are cache systems. And as you say, they can be used for caching.

Caching systems know when something needs updating several ways. One is pubsub: When the information is loaded from the cache, you want to set up a realtime websocket or webhook for example, to reflect any changes since the cached info was shown. If you can manage to have a server running, you can simply receive new data (deltas) and update your cached data — in that case you can even have it ready and never stale by the time it is shown.

If you can’t run a server and don’t want to reveive pushes then another approach simply involves storing the latest ordinal or state for EACH cached item locally (e-tags does this) and then before rendering (or right after you do) bulk-sending the tags and seeing what has been updated, then pulling just that. The downside is that you may have a flash of old content if you show it optimistically.

If you combine the two methods, you can easily get an up-to-date syndication of a remote mutable data store.

My whole framework is built around such abstractions, to take care of them for people.

replies(2): >>44461090 #>>44461131 #

necovek ◴[04 Jul 25 04:16 UTC] No.44461090[source]▶

>>44460947 #

We can agree to disagree on this being a "caching system" and it "knowing when something needs updating" (from the API, I am guessing it's a "set" method).

Phrases like "simply" and "never stale" are doing a lot of heavy lifting. Yet you haven't answered a very simple question I posted above: how would Q_Cache structure handle a direct SQL write query from another process on the database it is being used as a cache for?

SQL databases don't run websockets to external servers, nor do they fire webhooks. If you are running a server which you are hitting with every update instead of doing direct SQL on the DB, there's always a small amount of time where a cache user can see stale data before this server manages to trigger the update of the cache.

replies(2): >>44461123 #>>44462335 #

EGreg ◴[04 Jul 25 04:27 UTC] No.44461123[source]▶

>>44461090 #

Simple. Use the principles I said above, for distributed and asynchronous systems.

The SQL server should have a way to do pubsub. It can then notify your PHP webhook via HTTP, or run a script on the command line, when a row changes. It should have an interface to subscribe to these changes.

If your SQL data store lacks the ability to notify others that a row changed, then there’s your main problem. Add that functionality. Make a TRIGGER in SQL for instance and use an extension.

If MySQL really lacks this ability then just put node.js middleware in front of it that will do this for you. And make sure that all mysql transactions from all clients go through that middleware. Now your combined MySQL+Node server is adequate to help clients avoid stale data.

As I already said, if you for some reason refuse to handle push updates, then you have to store the latest row state (etags) in your PHP cache (apcu). And then when you access a bunch of cached items on a request, at the end collect all their ids and etags and simply bulk query node.js for any cached items that changed. You can use bloom filters or merkle trees or prolly trees to optimize the query.

Joel Gustafson had a great article on this and now uses it in gossiplog: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...

replies(2): >>44461603 #>>44461641 #

logical42 ◴[04 Jul 25 06:12 UTC] No.44461603[source]▶

>>44461123 #

Adding change data capture to a database isn't exactly trivial.

replies(1): >>44462249 #

EGreg ◴[04 Jul 25 07:59 UTC] No.44462249[source]▶

>>44461603 #

Well then maybe consider making a sql replication client that would ingest the changes (like most databases, mysql writes to a straming append-only log, before it is compacted). Just parse the log and act on them.

Not trivial, really? Here, enjoy: https://github.com/krowinski/php-mysql-replication

replies(2): >>44476165 #>>44476178 #

logical42 ◴[05 Jul 25 22:44 UTC] No.44476178[source]▶

>>44462249 #

You didn't implement change data capture. You're using theirs.

replies(1): >>44477231 #

1. EGreg ◴[06 Jul 25 02:11 UTC] No.44477231[source]▶

>>44476178 #

I’m also not making my own programming language, I’m using PHP. And I’m using MySQL. What is your point?

Are you saying that “invalidating caches is hard… if you insist on using only systems with no way to push updates, and oh if they have a way to push updates then you’re using someone’s library to consume updates so it doesn’t count?”

↑