(github.com)

29 points ignaciovdk | 2 comments | 07 Oct 25 16:40 UTC | HN request time: 0.408s | source

Hi HN, I’m Ignacio, founder at Basekick Labs.

Over the past months I’ve been building Arc, a time-series data platform designed to combine very fast ingestion with strong analytical queries.

What Arc does? Ingest via a binary MessagePack API (fast path), Compatible with Line Protocol for existing tools (Like InfluxDB, I'm ex Influxer), Store data as Parquet with hourly partitions, Query via DuckDB engine using SQL

Why I built it:

Many systems force you to trade retention, throughput, or complexity. I wanted something where ingestion performance doesn’t kill your analytics.

Performance & benchmarks that I have so far.

Write throughput: ~1.88M records/sec (MessagePack, untuned) in my M3 Pro Max (14 cores, 36gb RAM) ClickBench on AWS c6a.4xlarge: 35.18 s cold, ~0.81 s hot (43/43 queries succeeded) In those runs, caching was disabled to match benchmark rules; enabling cache in production gives ~20% faster repeated queries

I’ve open-sourced the Arc repo so you can dive into implementation, benchmarks, and code. Would love your thoughts, critiques, and use-case ideas.

Thanks!

Show context

bormaj ◴[08 Oct 25 00:03 UTC] No.45510476[source]▶

>>45505407 (OP) #

Exciting project and definitely something I'd like to explore using. I particularly like the look of the API ergonomics. A few questions:

- is the schema inferred from the data? - can/does the schema evolve? - are custom partitions supported? - is there a roadmap for future features?

replies(1): >>45510935 #

1. ignaciovdk ◴[08 Oct 25 01:11 UTC] No.45510935[source]▶

>>45510476 #

Thanks! Let’s go by parts, as Jason would say

Schema inference: yes, Arc infers the schema automatically from incoming data (both for MessagePack and Line Protocol). Each measurement becomes a table, and fields/tags map to columns.

Schema evolution: supported. New fields can appear at any time, they’re added to the Parquet schema automatically without migration or downtime.

Custom partitions: currently partitioning is time-based (hour-level by default), but custom partitioning by tag or host or whatever is planned. The idea is to allow you to group by any tag (e.g. device, region) in the storage path for large-scale IoT data.

Roadmap: absolutely. Grafana data source, Prometheus remote write, retention policies, gRPC streaming, and distributed query execution are all in the works.

We are going to start to blogging about it, so, stay tune.

Would love any feedback on what you’d prioritize or what would make adoption easier for your use case.

replies(1): >>45518477 #

2. bormaj ◴[08 Oct 25 17:24 UTC] No.45518477[source]▶

>>45510935 (TP) #

My use case isn't IOT, but about once a month I get a massive data dump from a vendor. Think tens of millions of rows and 100+ columns. Cleaning, ingesting and querying this data via standard RDBMS is a slow and brittle process. There is a time series aspect, but partitioning across other keys/groups is critical.

↑

Show HN: Arc – high-throughput time-series warehouse with DuckDB analytics