Most active commenters

mbrt(12)
(3)
Onavo(3)

Popular/hot comments

>>42164124 #
>>42170647 #

Transactional Object Storage?

(blog.mbrt.dev)

1. ◴[17 Nov 24 13:20 UTC] No.42164059[source]▶

>>42164058 (OP) #

2. svrakitin ◴[17 Nov 24 13:34 UTC] No.42164124[source]▶

>>42164058 (OP) #

Pretty cool! Do you have any ideas already about how to make it work with S3, considering it doesn't support If- headers?

replies(4): >>42167487 #>>42170695 #>>42239456 #>>42242819 #

3. boulos ◴[17 Nov 24 21:31 UTC] No.42167487[source]▶

>>42164124 #

S3 recently added basic matching support (https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3..., https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...).

They don't have the full suite of GCS's capabilities (https://cloud.google.com/storage/docs/request-preconditions#...) but it's something.

4. Onavo ◴[18 Nov 24 06:13 UTC] No.42170159[source]▶

>>42164058 (OP) #

Congrats on reinventing the data lake? This is actually how most of the newer generations of "cloud native" databases work, where they separate compute and storage. The key is that they have a more sophisticated caching layer so that the latency cost of a query can be amortized across requests.

replies(2): >>42170647 #>>42238491 #

5. mbrt ◴[18 Nov 24 08:09 UTC] No.42170647[source]▶

>>42170159 #

It's my understanding that the newer generation of data lakes still make use of a tiny, strongly consistent metadata database to keep track of what is where. This is orders of magnitudes smaller than what you'd have by putting everything in the same database, but it's still there. This is also the case in newer data streaming platforms (e.g. https://www.warpstream.com/blog/kafka-is-dead-long-live-kafk...).

I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.

replies(3): >>42170771 #>>42239063 #>>42241578 #

6. mbrt ◴[18 Nov 24 08:19 UTC] No.42170695[source]▶

>>42164124 #

I think it's now much easier to achieve than a year ago. The critical one is conditional writes on new objects, because otherwise you can't safely create transaction logs in the presence of timeouts. This is not enough though.

My approach on S3 would be to ensure to modify the ETag of an object whenever other transactions looking at it must be blocked. This makes it easier to use conditional reads (https://docs.aws.amazon.com/AmazonS3/latest/userguide/condit...) on COPY or GET operations.

For write, I would use PUT on a temporary staging area and then conditional COPY + DELETE afterward. This is certainly slower than GCS, but I think it should work.

Locking without modifying the object is the part that needs some optimization though.

replies(2): >>42239404 #>>42239412 #

7. Onavo ◴[18 Nov 24 08:42 UTC] No.42170771{3}[source]▶

>>42170647 #

Love your article by the way. Not an expert but off the top of my head:

https://docs.datomic.com/operation/architecture.html

(However they cheat with dynamo lol)

There's also some listed here

https://davidgomes.com/separation-of-storage-and-compute-and...

replies(1): >>42172419 #

8. mbrt ◴[18 Nov 24 14:13 UTC] No.42172419{4}[source]▶

>>42170771 #

OK, thanks for the reference. Yeah, so indeed separating storage and compute is nothing new. Definitely not claiming I invented that :)

And as you mention, Datomic uses DynamoDB as well (so, not a pure s3 solution). What I'm proposing is to only use object storage for everything, pay the price in latency, but don't give up on throughput, cost and consistency. The differentiator is that this comes with strict serializability guarantees, so this is not an eventually consistent system (https://jepsen.io/consistency/models/strong-serializable).

No matter how sophisticated the caching is, if you want to retain strict serializability, writes must be confirmed by s3 and reads must validate in s3 before returning, which puts a lower bound on latency.

I focused a lot on throughput, which is the one we can really optimize.

Hopefully that's clear from the blog, though.

replies(1): >>42178088 #

9. Onavo ◴[18 Nov 24 23:01 UTC] No.42178088{5}[source]▶

>>42172419 #

Have you seen https://news.ycombinator.com/item?id=42174204

replies(1): >>42180752 #

10. mbrt ◴[19 Nov 24 07:10 UTC] No.42180752{6}[source]▶

>>42178088 #

I just saw it! I asked a question (https://news.ycombinator.com/item?id=42180611) and it seems that durability and consistency are implemented at the caching layer.

Basically an in-memory database which uses S3 as cold storage. Definitely an interesting approach, but no transactions AFAICT.

11. ◴[25 Nov 24 18:06 UTC] No.42238491[source]▶

>>42170159 #

12. victorbjorklund ◴[25 Nov 24 18:54 UTC] No.42239010[source]▶

>>42164058 (OP) #

Pretty cool and could be useful for stuff that isnt updated so frequently like a CMS.

13. eatonphil ◴[25 Nov 24 18:58 UTC] No.42239063{3}[source]▶

>>42170647 #

> I'm curious to hear if you have examples of any database using only object storage as a backend, because back when I started, I couldn't fin any.

Take a look at Delta Lake

https://notes.eatonphil.com/2024-09-29-build-a-serverless-ac...

replies(1): >>42250063 #

14. ramesh31 ◴[25 Nov 24 19:32 UTC] No.42239375[source]▶

>>42164058 (OP) #

so... Delta Lake?

15. ◴[25 Nov 24 19:36 UTC] No.42239404{3}[source]▶

>>42170695 #

16. mbrt ◴[25 Nov 24 19:36 UTC] No.42239412{3}[source]▶

>>42170695 #

And I see more possibilities now that https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3... is available. It will get easier and easier to build serverless data lakes, streaming, queues.

17. choppaface ◴[25 Nov 24 19:41 UTC] No.42239456[source]▶

>>42164124 #

Not a full solution, but seeing the OP seeks to be a key-value store (versus full RDBMS? despite the comparisons with Spanner and Postgres?), important to weigh how Rockset (also mainly KV store) dealt with S3-backed caching at scale:

  * https://rockset.com/blog/separate-compute-storage-rocksdb/

  * https://github.com/rockset/rocksdb-cloud

Keep in mind Rockset is definitely a bit biased towards vector search use cases.

replies(1): >>42239743 #

18. jitl ◴[25 Nov 24 20:14 UTC] No.42239733[source]▶

>>42164058 (OP) #

There is also SlateDB, another work in progress take on this. HN link: https://news.ycombinator.com/item?id=41714858

replies(1): >>42240384 #

19. mbrt ◴[25 Nov 24 20:15 UTC] No.42239743{3}[source]▶

>>42239456 #

Nice, thanks for the reference!

BTW, the comparison was only to give an idea about isolation levels, it wasn't meant to be a feature-to-feature comparison.

Perhaps I didn't make it prominent enough, but at some point I say that many SQL databases have key-value stores at their core, and implement a SQL layer on top (e.g. https://www.cockroachlabs.com/docs/v22.1/architecture/overvi...).

Basically SQL can be a feature added later to a solid KV store as a base.

20. social_quotient ◴[25 Nov 24 21:16 UTC] No.42240300[source]▶

>>42164058 (OP) #

I found myself thinking about Cloudflare Durable objects new SQLite offering.

Nicely detailed here https://simonwillison.net/2024/Oct/13/zero-latency-sqlite-st... And https://developers.cloudflare.com/durable-objects/best-pract...

replies(1): >>42242892 #

21. jacobmarble ◴[25 Nov 24 21:25 UTC] No.42240370[source]▶

>>42164058 (OP) #

If I had time, I'd like to implement an Iceberg catalog this way.

replies(1): >>42242008 #

22. maxmcd ◴[25 Nov 24 21:27 UTC] No.42240384[source]▶

>>42239733 #

Yeah I think it's very interesting to compare the two. SlateDB expects a single writer and fences writes. This means you can make some serious savings on S3 costs because you're using S3 for consistency but you're batching writes.

GlassDB is much more accessible for smaller volume workloads, but gets very costly for high volume because of requests to S3 per-transaction. In-turn the consistency model is easier to reason about because the system is entirely stateless.

23. jsd1982 ◴[25 Nov 24 23:59 UTC] No.42241344[source]▶

>>42164058 (OP) #

Was it considered to separate each table into its own S3 object?

replies(2): >>42242751 #>>42242828 #

24. vineyardmike ◴[26 Nov 24 00:45 UTC] No.42241578{3}[source]▶

>>42170647 #

> if you have examples of any database using only object storage as a backend

I think DuckDB is very close to this. It's a bit different, because it's mostly for read-heavy workloads.

https://duckdb.org/docs/extensions/httpfs/s3api

(BTW great article, excellent read!)

25. c4pt0r ◴[26 Nov 24 01:43 UTC] No.42241827[source]▶

>>42164058 (OP) #

TiDB Serverless is built on S3, it's in production for more than 2 years, Blog link: https://me.0xffff.me/dbaas2.html

26. akshayshah ◴[26 Nov 24 02:22 UTC] No.42242008[source]▶

>>42240370 #

100% this! S3’s lack of write preconditions spawned the whole Iceberg catalog ecosystem anyways.

27. up2isomorphism ◴[26 Nov 24 04:27 UTC] No.42242603[source]▶

>>42164058 (OP) #

Whenever I saw the claim that “S3 is cheap “, I just cannot take it too seriously.

replies(1): >>42242852 #

28. jahewson ◴[26 Nov 24 05:04 UTC] No.42242751[source]▶

>>42241344 #

Only if you don’t want transactions across tables?

29. kyle787 ◴[26 Nov 24 05:05 UTC] No.42242757[source]▶

>>42164058 (OP) #

I appreciate the effective use of diagrams. The boundaries are separated really nicely.

30. tlarkworthy ◴[26 Nov 24 05:24 UTC] No.42242819[source]▶

>>42164124 #

You can do it without using an append only logs https://github.com/endpointservices/mps3

However it will be much simpler with the new conditional writes

31. mbrt ◴[26 Nov 24 05:27 UTC] No.42242828[source]▶

>>42241344 #

It's a good observation, because I did and decided to keep it out of scope from the base layer.

But this is entirely possible. You can wrap GlassDB transactions and encode multiple keys into the same object at a higher level. Transactions across different objects will still preserve the same isolation.

The current version is meant to be a base from which to build higer level APIs, somewhat like FoundationDB.

32. mbrt ◴[26 Nov 24 05:32 UTC] No.42242852[source]▶

>>42242603 #

You're right indeed:) but it depends on what you are comparing it with. In this case the comparison is against other managed cloud storage and databases, and in that context I think the claim holds.

Is it the cheapest possible storage in existence? No, if you take raw disks and put them in a rack, but I also feel it wouldn't be an entirely fair comparison.

replies(1): >>42245674 #

33. mbrt ◴[26 Nov 24 05:42 UTC] No.42242892[source]▶

>>42240300 #

This builds on the same intuition I had, where data can be easily partitioned across objects. What seems to be missing is transactions across different objects though?

The flipside is that Cloudflare DO will be a lot faster.

Interesting that all these similar solutions are popping out now.

I think it would be interesting to combine a SQLite per-object approach with transactions on top of different objects.

34. eek2121 ◴[26 Nov 24 13:46 UTC] No.42245674{3}[source]▶

>>42242852 #

S3 is one of the most expensive platforms out there, however. Look at backblaze B2 for an example of just HOW expensive S3 is.

When i moved from S3 to DO, my bill went from hundreds to $20/mo. The only thing that changed was the hosting provider.

replies(1): >>42249886 #

35. mbrt ◴[26 Nov 24 20:45 UTC] No.42249886{4}[source]▶

>>42245674 #

B2 is mostly S3-compatible, so if they add the same support for preconditions on writes as S3 and GCS, nothing prevents using it as a backend for GlassDB.

36. mbrt ◴[26 Nov 24 21:05 UTC] No.42250063{4}[source]▶

>>42239063 #

Wow, not sure how I missed this, but I see many similarities. They were also bitten by lack of conditional writes in S3:

> In Databricks service deployments, we use a separate lightweight coordination service to ensure that only one client can add a record with each log ID.

The key difference is that Delta Lake implements MVCC and relies on total ordering of transaction IDs. Something I didn't want to do to avoid forced synchronization points (multiple clients need to fight for IDs). This is certainly a trade-off, because in my case you are forced to read the latest version or retry (but then you get strict serializability), while in Delta Lake you can rely on snapshot isolation, which might give you slightly stale, but consistent data and minimize retries on reads.

It also seems that you can't get transactions across different tables? Another interesting tradeoff.

↑