←back to thread

245 points gatesn | 1 comments | | HN request time: 0s | source
Show context
ericyd ◴[] No.41840555[source]
Thank God this file format is written in Rust, otherwise I'd be extremely skeptical.
replies(3): >>41840649 #>>41841345 #>>41842509 #
neeh0 ◴[] No.41840649[source]
It gave me a moment of pause why Rust is part of the equation, but I concluded I'm too dumb
replies(2): >>41842842 #>>41843262 #
1. aduffy ◴[] No.41843262[source]
Buried under the memes/vibes there is an actual reason this is important for data tools.

The previous generation of analytics/"Big Data" projects (think Hadoop, Spark, Kafka, Elastic) were all built in the JVM. They were monolithic distributed systems clusters hosted on VMs or on-premise. They were servers with clients implemented in Java. It is effectively impossible to embed a Java library into anything non-Java, the best you can do is fork a JVM with a carefully maintained classpath and hit it over the network (c.f. PySpark). Kafka has externally maintained bindings that lag the official JVM client.

Parquet was built during this era, so naturally its reference implementation was written in Java. For many years, the only implementation of Parquet was in Java. Even when parquet-cpp and subsequent implementations began to pop up, the Parquet Java implementation was still the best maintained. Over time as the spec got updated and new features made their way into Parquet, different implementations had different support. Files written by parquet-cpp or parquet-rs could not be opened via Spark or Presto.

The newer generation of data analytics tooling is meant to be easily embedded, so that generally means a native language that can export shared objects with a C ABI that can be consumed by the FFI layer of different languages. That leaves you a few options, and of those Rust is arguably the best for reasons of tooling and ecosystem, though different projects make different choices. DuckDB for example is an extremely popular library with bindings in several languages and it was built in C++ long after Rust became in-vogue.

While Vortex doesn't (yet) have a C API, we do have Python bindings that we expect to be the main way people use it.