←back to thread

379 points Sirupsen | 1 comments | | HN request time: 0.211s | source
Show context
eknkc ◴[] No.40921379[source]
Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

replies(5): >>40922137 #>>40922166 #>>40922842 #>>40923712 #>>40927099 #
imiric ◴[] No.40922166[source]
ClickHouse can also read from S3. I'm not sure how it compares to DuckDB re efficiency, but it worked fine for my simple use case.
replies(1): >>40922670 #
masterj ◴[] No.40922670[source]
Neither of these support indexes afaik. They are designed to do fast scans / computation.
replies(2): >>40923656 #>>40930836 #
1. orthecreedence ◴[] No.40930836[source]
Depends what you mean by "indexes." DuckDB can read path parameters (ex s3://my-bucket/category=beverages/month=2022-01-01/*/*.parquet) where `category` and `month` can be filtered at the query level, skipping any non-matching files. I think that qualifies as an index. Obviously, you'd have to create these up-front, or risk moving lots of data between paths.