←back to thread

66 points chaokunyang | 4 comments | | HN request time: 0.22s | source

Serialization framework with some interesting numbers: 10-20x faster on nested objects than json/protobuf.

  Technical approach: compile-time codegen (no reflection), compact binary protocol with meta-packing, little-endian layout optimized for modern CPUs.

  Unique features that other fast serializers don't have:
  - Cross-language without IDL files (Rust ↔ Python/Java/Go)
  - Trait object serialization (Box<dyn Trait>)
  - Automatic circular reference handling
  - Schema evolution without coordination

  Happy to discuss design trade-offs.

  Benchmarks: https://fory.apache.org/docs/benchmarks/rust
Show context
no_circuit ◴[] No.45738851[source]
Are the benchmarks actually fair? See:

https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...

It seems if the serialization object is not a "Fory" struct, then it is forced to go through to/from conversion as part of the measured serialization work:

https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...

The to/from type of work includes cloning Strings:

https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...

reallocating growing arrays with collect:

https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...

I'd think that the to/from Fory types is shouldn't be part of the tests.

Also, when used in an actual system tonic would be providing a 8KB buffer to write into, not just a Vec::default() that may need to be resized multiple times:

https://github.com/hyperium/tonic/blob/147c94cd661c0015af2e5...

replies(1): >>45741094 #
1. no_circuit ◴[] No.45741094[source]
IMO, not a fair benchmark.

I can see the source of an 10x improvement on an Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz, but it drops to 3x improvement when I remove the to/from that clones or collects Vecs, and always allocate an 8K Vec instead of a ::Default for the writable buffer.

If anything, the benches should be updated in a tower service / codec generics style where other formats like protobuf do not use any Fory-related code at all.

Note also that Fory has some writer pool that is utilized during the tests:

https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...

Original bench selection for Fory:

    Benchmarking ecommerce_data/fory_serialize/medium: Collecting 100 samples in estimated 5.0494 s (197k it
    ecommerce_data/fory_serialize/medium
                            time:   [25.373 µs 25.605 µs 25.916 µs]
                            change: [-2.0973% -0.9263% +0.2852%] (p = 0.15 > 0.05)
                            No change in performance detected.
    Found 4 outliers among 100 measurements (4.00%)
      2 (2.00%) high mild
      2 (2.00%) high severe
Compared to original bench for Protobuf/Prost:

    Benchmarking ecommerce_data/protobuf_serialize/medium: Collecting 100 samples in estimated 5.0419 s (20k
    ecommerce_data/protobuf_serialize/medium
                            time:   [248.85 µs 251.04 µs 253.86 µs]
    Found 18 outliers among 100 measurements (18.00%)
      8 (8.00%) high mild
      10 (10.00%) high severe
However after allocating 8K instead of ::Default and removing to/from it for an updated protobuf bench:

    fair_ecommerce_data/protobuf_serialize/medium
                            time:   [73.114 µs 73.885 µs 74.911 µs]
                            change: [-1.8410% -0.6702% +0.5190%] (p = 0.30 > 0.05)
                            No change in performance detected.
    Found 14 outliers among 100 measurements (14.00%)
      2 (2.00%) high mild
      12 (12.00%) high severe
replies(2): >>45743303 #>>45743308 #
2. chaokunyang ◴[] No.45743303[source]
The Rust benchmarks in Fory are intended more as end‑to‑end benchmarks for typical OOP‑style application scenarios, not just raw buffer write speed.

Protobuf is very much a DOP (data‑oriented programming) approach — which is great for some systems. But in many complex applications, especially those using polymorphism, teams don’t want to couple Protobuf‑generated message structs directly into their domain models. Generated types are harder to extend, and if you embed them everywhere (fields, parameters, return types), switching to another serialization framework later becomes almost impossible without touching huge parts of the codebase.

In large systems, it’s common to define independent domain model structs used throughout the codebase, and only convert to/from the Protobuf messages at the serialization boundary. That conversion step is exactly what’s represented in our benchmarks — because it’s what happens in many real deployments.

There’s also the type‑system gap: for example, if your Rust struct has a Box<dyn Trait> field, representing that cleanly in Protobuf is tricky. You might fall back to a oneof, but that essentially generates an enum variant, which often isn’t what users actually want for polymorphic behavior.

So, yes — we include the conversion in our measurements intentionally, to reflect the real‑world large systems practices.

replies(1): >>45749424 #
3. chaokunyang ◴[] No.45743308[source]
DOP is great for certain scenarios, but there’s always a gap between DOP and OOP. That gap is where an extra domain model and the conversion step come in — especially in systems that rely heavily on polymorphism or want to keep serialization types decoupled from core business models.
4. no_circuit ◴[] No.45749424[source]
Yes, I agree that protos usually should only be used at the serialization boundary, as well as the slightly off-topic idea that the generated code should be private to the package and/or binary.

So to reflect the real‑world practices, the benchmark code should then allocate and give the protobuf serializer an 8K Vec like in tonic, and not an empty one that may require multiple re-allocations?