Both are not that difficult, honestly.
Aren’t there a lot harder things out there
Here's one for PHP: https://github.com/Qbix/Platform/blob/main/platform/classes/...
And here is for the Web: https://github.com/Qbix/Platform/blob/dc95bd85fa386e45546b0b...
I also discuss caching in the context of HTTP: https://community.qbix.com/t/caching-and-sessions/192
WRITING ABOUT CACHING AND INVALIDATION
You should especially start here: https://community.qbix.com/t/qbix-websites-loading-quickly/2...And then you can read about incremental updates: https://community.qbix.com/t/streams-plugin-messages-subscri...
Caching becomes hard when such a stores are used in distributed or concurrent contexts.
An example: imagine Qcache being used when fetching data from an SQL database. And data in the database changes with a direct SQL query from another process.
How will your key-value store know that the value in it is stale and needs refreshing?
Caching systems know when something needs updating several ways. One is pubsub: When the information is loaded from the cache, you want to set up a realtime websocket or webhook for example, to reflect any changes since the cached info was shown. If you can manage to have a server running, you can simply receive new data (deltas) and update your cached data — in that case you can even have it ready and never stale by the time it is shown.
If you can’t run a server and don’t want to reveive pushes then another approach simply involves storing the latest ordinal or state for EACH cached item locally (e-tags does this) and then before rendering (or right after you do) bulk-sending the tags and seeing what has been updated, then pulling just that. The downside is that you may have a flash of old content if you show it optimistically.
If you combine the two methods, you can easily get an up-to-date syndication of a remote mutable data store.
My whole framework is built around such abstractions, to take care of them for people.
Phrases like "simply" and "never stale" are doing a lot of heavy lifting. Yet you haven't answered a very simple question I posted above: how would Q_Cache structure handle a direct SQL write query from another process on the database it is being used as a cache for?
SQL databases don't run websockets to external servers, nor do they fire webhooks. If you are running a server which you are hitting with every update instead of doing direct SQL on the DB, there's always a small amount of time where a cache user can see stale data before this server manages to trigger the update of the cache.
The SQL server should have a way to do pubsub. It can then notify your PHP webhook via HTTP, or run a script on the command line, when a row changes. It should have an interface to subscribe to these changes.
If your SQL data store lacks the ability to notify others that a row changed, then there’s your main problem. Add that functionality. Make a TRIGGER in SQL for instance and use an extension.
If MySQL really lacks this ability then just put node.js middleware in front of it that will do this for you. And make sure that all mysql transactions from all clients go through that middleware. Now your combined MySQL+Node server is adequate to help clients avoid stale data.
As I already said, if you for some reason refuse to handle push updates, then you have to store the latest row state (etags) in your PHP cache (apcu). And then when you access a bunch of cached items on a request, at the end collect all their ids and etags and simply bulk query node.js for any cached items that changed. You can use bloom filters or merkle trees or prolly trees to optimize the query.
Joel Gustafson had a great article on this and now uses it in gossiplog: https://joelgustafson.com/posts/2023-05-04/merklizing-the-ke...
What if an SQL update changes a million rows, and you had cached the sum of those million rows? Should it send a million notifications? Should it re-run the sum for you? What if it's a complex operation and it takes a while to re-compute, and another update arrives before the re-compute finishes?
And of course, you will always have some partitions, so you will occasionally need to re-query the data anyway and re-establish any kind of pubsub system. And the complexity in efficiently reconciling two views of a large amount of data is a major issue, that you are just brushing off as "you can do various things to maybe optimize the query".
All the difficulties you cite, including the “optimizations” I mentioned, stem from not using push, and insisting on periodically polling something.
If you step back, your whole problem is that you are using a system that only has pull and not push architecture. You’re trying to put a bandaid on that core fact. You chose a relational database system that you refuse to build extensions for (as opposed to pretty much any other system) in order to insist “nyeh nyeh you havent solved cache invalidation”.
Caching is just one part of sync. The part that stores a non-authoritative, slightly out of date copy locally. And sync, or replication, is just one part of an eventually consistent system. I mean, even MySQL supports replication, so just hook your client up to that protocol (encrypted in transit) and boom, now you can update your cache.
Here’s the thing. If you use any network protocol that is open, eg HTTP, then yea it’s solved. Because you can have webhooks and just spin up a server to handle them, and update your cache. A row is removed? Decrement your sum. A row is added? Increment it.
You are just insisting that “no, our system has to be clunky and we refuse to implement webhooks / websockets / sockets / push / mysql replication client / updates of any kind, now solve it for me to be as good as a system which has push capabilities.”
Not trivial, really? Here, enjoy: https://github.com/krowinski/php-mysql-replication
Mysql and Postgresql does in fact have push. It is called replication. The protocol is documented. So, much like a webhook with HTTP protocol, you can ingest changes on a table with a php long-running client and then simply update your apcu cache entries if they are still being stored, and matching that row. And all your PHP processes (managed by php-fpm or FrankenPHP or whatever) will benefit from that atomic apcu update.
There is nothing that says only another mysql server can be a mysql replication client.
Boom, solved. Even in your case. Note that the SQL database did in fact expect you to open a socket, and then it writes its log to that socket. That’s all the streaming events you need.
But now you’re gonna say “but but no you cant have a long running PHP client, we’re talking about a shared hosting environment where the host disables your ability to run a long PHP service that opens a socket”. Well, look. Most of the difficultu is that you’re intentionally trying to use the most clunky architectures just to “prove” that “cache invalidation is really hard.”
Shall we continue with Naming next? :)
More importantly, you are pushing all the complex logic to this imaginary DB replication client: your "caching system" is still nothing more than a KV store.
Are you saying that “invalidating caches is hard… if you insist on using only systems with no way to push updates, and oh if they have a way to push updates then you’re using someone’s library to consume updates so it doesn’t count?”