Turbopuffer: Fast search on object storage

(turbopuffer.com)

379 points Sirupsen | 1 comments | 09 Jul 24 14:48 UTC | HN request time: 0.211s | source

Show context

eknkc ◴[09 Jul 24 21:29 UTC] No.40921379[source]▶

Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

replies(5): >>40922137 #>>40922166 #>>40922842 #>>40923712 #>>40927099 #

imiric ◴[09 Jul 24 22:58 UTC] No.40922166[source]▶

>>40921379 #

ClickHouse can also read from S3. I'm not sure how it compares to DuckDB re efficiency, but it worked fine for my simple use case.

replies(1): >>40922670 #

masterj ◴[10 Jul 24 00:23 UTC] No.40922670[source]▶

>>40922166 #

Neither of these support indexes afaik. They are designed to do fast scans / computation.

replies(2): >>40923656 #>>40930836 #

1. orthecreedence ◴[10 Jul 24 19:54 UTC] No.40930836[source]▶

>>40922670 #

Depends what you mean by "indexes." DuckDB can read path parameters (ex s3://my-bucket/category=beverages/month=2022-01-01/*/*.parquet) where `category` and `month` can be filtered at the query level, skipping any non-matching files. I think that qualifies as an index. Obviously, you'd have to create these up-front, or risk moving lots of data between paths.

↑