Turbopuffer: Fast search on object storage

(turbopuffer.com)

379 points Sirupsen | 3 comments | 09 Jul 24 14:48 UTC | HN request time: 0s | source

Show context

eknkc ◴[09 Jul 24 21:29 UTC] No.40921379[source]▶

Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

replies(5): >>40922137 #>>40922166 #>>40922842 #>>40923712 #>>40927099 #

imiric ◴[09 Jul 24 22:58 UTC] No.40922166[source]▶

>>40921379 #

ClickHouse can also read from S3. I'm not sure how it compares to DuckDB re efficiency, but it worked fine for my simple use case.

replies(1): >>40922670 #

1. masterj ◴[10 Jul 24 00:23 UTC] No.40922670[source]▶

>>40922166 #

Neither of these support indexes afaik. They are designed to do fast scans / computation.

replies(2): >>40923656 #>>40930836 #

2. hodgesrm ◴[10 Jul 24 04:15 UTC] No.40923656[source]▶

>>40922670 (TP) #

It depends on what you mean by "support." ClickHouse as I recall can read min/max indexes from Parquet row groups. One of my colleagues is working on a PR to add support for bloom filter indexes. So that will be covered as well.

Right now one of the main performance problems is that Clickhouse does not cache index metadata yet, so you still have to scan files rather than keeping the metadata in memory. ClickHouse does this for native MergeTree tables. There are a couple of steps to get there but I have no doubt that metadata caching will be properly handled soon.

Disclaimer: I work for Altinity, an enterprise provider for ClickHouse software.

3. orthecreedence ◴[10 Jul 24 19:54 UTC] No.40930836[source]▶

>>40922670 (TP) #

Depends what you mean by "indexes." DuckDB can read path parameters (ex s3://my-bucket/category=beverages/month=2022-01-01/*/*.parquet) where `category` and `month` can be filtered at the query level, skipping any non-matching files. I think that qualifies as an index. Obviously, you'd have to create these up-front, or risk moving lots of data between paths.

↑