Most active commenters
  • amluto(3)

←back to thread

174 points andy99 | 14 comments | | HN request time: 1.493s | source | bottom
1. g-mork ◴[] No.43603642[source]
When did vulnerability reports get so vague? Looks like a classic serialization bug

https://github.com/apache/parquet-java/compare/apache-parque...

replies(3): >>43603809 #>>43604045 #>>43604276 #
2. amluto ◴[] No.43603809[source]
Better link: https://github.com/apache/parquet-java/pull/3169

If by “classic” you mean “using a language-dependent deserialization mechanism that is wildly unsafe”, I suppose. The surprising part is that Parquet is a fairly modern format with a real schema that is nominally language-independent. How on Earth did Java class names end up in the file format? Why is the parser willing to parse them at all? At most (at least by default), the parser should treat them as predefined strings that have semantics completely independent of any actual Java class.

replies(1): >>43603943 #
3. bri3d ◴[] No.43603943[source]
This seems to come from parquet-avro, which looks to attempt to embed Avro in Parquet files and in the course of doing so, does silly Java reflection gymnastics. I don’t think “normal” parquet is affected.
replies(2): >>43604120 #>>43604161 #
4. amluto ◴[] No.43604120{3}[source]
The documentation for all of this is atrocious.

But if avro-in-parquet is a weird optional feature, it should be off by default! Parquet’s metadata is primarily in Thrift, not Avro, and it seems to me that no Avro should be involved in decoding Parquet files unless explicitly requested.

replies(1): >>43605060 #
5. tikhonj ◴[] No.43604161{3}[source]
Last time I tried to use the official Apache Parquet Java library, parsing "normal" Parquet files depended on parquet-avro because the library used Avro's GenericRecord class to represent rows from Parquet files with arbitrary schemas. So this problem would presumably affect any kind of Parquet parsing, even if there is absolutely no Avro actually involved.

(Yes, this doesn't make sense; the official Parquet Java library had some of the worst code design I've had the misfortune to depend on.)

replies(2): >>43604367 #>>43605332 #
6. hypeatei ◴[] No.43604276[source]
Tangential, but there was a recent sandbox escape vulnerability in both Chrome and Firefox.

The bug threads are still private, almost two weeks since it was disclosed and fixed. Very strange.

https://bugzilla.mozilla.org/show_bug.cgi?id=1956398

https://issues.chromium.org/issues/405143032

https://www.cve.org/CVERecord?id=CVE-2025-2783

replies(2): >>43604716 #>>43604761 #
7. twoodfin ◴[] No.43604367{4}[source]
Indeed, given the massive interest Parquet has generated over the past 5 years, and its critical role in modern data infrastructure, I’ve been disappointed every time I’ve dug into the open source ecosystem around it for one reason or another.

I think it’s revealing and unfortunate that everyone serious about Parquet, from DuckDB to Databricks, has written their own “codec”.

Some recent frustrations on this front from the DuckDB folks:

https://duckdb.org/2025/01/22/parquet-encodings.html

replies(1): >>43609018 #
8. hovav ◴[] No.43604716[source]
Standard operating procedure for both the Chrome [https://chromium.googlesource.com/chromium/src/+/HEAD/docs/s...] and Firefox [https://www.mozilla.org/en-US/about/governance/policies/secu...] bug tracking systems.

But the fix itself is public in both the Chrome [https://chromium.googlesource.com/chromium/src.git/+/36dbbf3...] and Firefox [https://github.com/mozilla/gecko-dev/commit/ac605820636c3b96...] source repos, and it makes pretty clear what the bug is.

replies(1): >>43604798 #
9. ◴[] No.43604761[source]
10. benatkin ◴[] No.43604798{3}[source]
Looks like this one only applied to windows. Here’s a link to the diff: https://chromium.googlesource.com/chromium/src.git/+/36dbbf3...
11. bri3d ◴[] No.43605060{4}[source]
To the sibling comment’s point, I suppose it’s not weird in the Java ecosystem. The parquet-java project has a design where it deserializes Parquet fields into Java representations grabbed from _other_ projects rather than either having some kind of canonical self-representation in memory or acting as just an abstract codec. So, one of the most common things to do is apparently to use the “Avro” flavored serdes to get generic records in memory (note that the actual Avro serialization format is not involved with doing that; parquet-java just uses the classes from Avro as the in memory representations and deserializes Parquet into them). The whole approach seems a bit goofy; I’d expect the library to work as some kind of abstracted codec interface (requiring the in-memory representations to host Parquet, rather than the other way around - like how pandas hosts fastparquet in Python land) or provide a canonical object representation. Instead, it’s this in between where it has a grab bag of converters that transform Parquet to and from random object types pulled from elsewhere in the Java ecosystem.
replies(1): >>43605376 #
12. jeeeb ◴[] No.43605332{4}[source]
The Apache Arrow libraries are a good alternative for reading parquet files in Java. They provide a column oriented interface, rather than the ugly Avro stuff in the Apache Parquet library.
13. amluto ◴[] No.43605376{5}[source]
I’d still like to see a clear explanation of where one can stick a Java class name in a Parquet file such that it ends up interpreted by the Avro codec. And I’m curious why it was fixed by making a list of allowed class names instead of disabling the entire mechanism.
14. dev_l1x_be ◴[] No.43609018{5}[source]
Unfortunately many of the big data libraries are like that and there is no motivation to fix these things. One example is the ORC Java libraries that had 100s of unnecessary dependencies while at the same time importing the filesystem into the format itself.