←back to thread

379 points Sirupsen | 1 comments | | HN request time: 0.206s | source
Show context
eknkc ◴[] No.40921379[source]
Is there a good general purpose solution where I can store a large read only database in s3 or something and do lookups directly on it?

Duckdb can open parquet files over http and query them but I found it to trigger a lot of small requests reading bunch of places from the files. I mean a lot.

I mostly need key / value lookups and could potentially store each key in a seperate object in s3 but for a couple hundred million objects.. It would be a lot more managable to have a single file and maybe a cacheable index.

replies(5): >>40922137 #>>40922166 #>>40922842 #>>40923712 #>>40927099 #
imiric ◴[] No.40922166[source]
ClickHouse can also read from S3. I'm not sure how it compares to DuckDB re efficiency, but it worked fine for my simple use case.
replies(1): >>40922670 #
masterj ◴[] No.40922670[source]
Neither of these support indexes afaik. They are designed to do fast scans / computation.
replies(2): >>40923656 #>>40930836 #
1. hodgesrm ◴[] No.40923656[source]
It depends on what you mean by "support." ClickHouse as I recall can read min/max indexes from Parquet row groups. One of my colleagues is working on a PR to add support for bloom filter indexes. So that will be covered as well.

Right now one of the main performance problems is that Clickhouse does not cache index metadata yet, so you still have to scan files rather than keeping the metadata in memory. ClickHouse does this for native MergeTree tables. There are a couple of steps to get there but I have no doubt that metadata caching will be properly handled soon.

Disclaimer: I work for Altinity, an enterprise provider for ClickHouse software.