> Any application or service using Apache Parquet Java library versions 1.15.0 or earlier is believed to be vulnerable (our own data indicates that this was introduced in version 1.8.0; however, current guidance is to review all historical versions). This includes systems that read or import Parquet files using popular big-data frameworks (e.g. Hadoop, Spark, Flink) or custom applications that incorporate the Parquet Java code. If you are unsure whether your software stack uses Parquet, check with your vendors or developers – many data analytics and storage solutions include this library.
Seems safe to assume yes, pandas is probably affected by using this library.
The "fix" in question also screams "delete this crap immediately": https://github.com/wgtmac/parquet-mr/commit/d185f867c1eb968a...
> Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.
Those should be unaffected.
https://github.com/apache/parquet-java/compare/apache-parque...
Python docs > library > pickle: https://docs.python.org/3/library/pickle.html
Re: a hypothetical pickle parser protocol that doesn't eval code at parse time; "skipcode pickle protocol 6: "AI Supply Chain Attack: How Malicious Pickle Files Backdoor Models" .. "Insecurity and Python Pickles" : https://news.ycombinator.com/item?id=43426963
If by “classic” you mean “using a language-dependent deserialization mechanism that is wildly unsafe”, I suppose. The surprising part is that Parquet is a fairly modern format with a real schema that is nominally language-independent. How on Earth did Java class names end up in the file format? Why is the parser willing to parse them at all? At most (at least by default), the parser should treat them as predefined strings that have semantics completely independent of any actual Java class.
On a second read, I realized a format problem was unlikely, but the headline just said, "Apache Parquet". My mind might the same conclusion if it said "safetensors" or "PNG".
But if avro-in-parquet is a weird optional feature, it should be off by default! Parquet’s metadata is primarily in Thrift, not Avro, and it seems to me that no Avro should be involved in decoding Parquet files unless explicitly requested.
(Yes, this doesn't make sense; the official Parquet Java library had some of the worst code design I've had the misfortune to depend on.)
The bug threads are still private, almost two weeks since it was disclosed and fixed. Very strange.
https://bugzilla.mozilla.org/show_bug.cgi?id=1956398
Like if a browser had a vulnerability parsing HTML of course it is a major concern because very often browsers to parse HTML from untrusted parties.
I think it’s revealing and unfortunate that everyone serious about Parquet, from DuckDB to Databricks, has written their own “codec”.
Some recent frustrations on this front from the DuckDB folks:
But the fix itself is public in both the Chrome [https://chromium.googlesource.com/chromium/src.git/+/36dbbf3...] and Firefox [https://github.com/mozilla/gecko-dev/commit/ac605820636c3b96...] source repos, and it makes pretty clear what the bug is.
Unless you are blindly accepting parquet formatted files this really doesn't seem that bad.
A vulnerability in parsing images, xml, json, html, css would be way more detrimental.
I can't think of many services that accept parquet files directly. And of those usually you are calling it directly via a backend service.
Users need to do their own assessments.
As a library, this is a huge problem. If you're a user of the library, you'll have to decide if your usage of it is problematic or not.
Either way, the safe solution is to just update the library. Or, based on the link shared elsewhere (https://github.com/apache/parquet-java/compare/apache-parque...) maybe avoid this library if you can, because the Java-specific code paths seem sketchy as hell to me.
Most systems do log user input though, and "proper validation" is an infamously squishy phrase that mostly acts as an excuse. The bottom line is that the natural/correct/idiomatic use of Log4j exposed the library directly to user-generated data. The similar use of Apache parquet (an obscure tool many of us are learning about for the first time) does not. That doesn't make it secure, but it makes the impact inarguably lower.
I mean, come on: the Log4j exploit was a global zero-day!
It is different than the cvss rating.
I do agree that in most cases the deployment specific configuration affects the ability to be exploited and users or developers should analyse their own configuration.
That's my point: if you start adding constraints to a vulnerability to reduce its scope, high CVE scores don't exist.
Any vulnerability that can be characterised as "pass contents through parser, full RCE" is a 10/10 vulnerability for me. I'd rather find out my application isn't vulnerable after my vulnerability scanner reports a critical issue than let it lurk with all the other 3/10 vulnerabilities about potential NULL pointers or complexity attacks in specific method calls.
And I think that's just wildly wrong sorry. I view something exploited in the wild to compromise real systems as a higher impact than something that isn't, and want to see a "score" value that reflects that (IMHO, critical) distinction. Agree to disagree, as it were.
I don't want to make too harsh remarks about the project, as it may simply not have been the right tool for my use case, though it sure gave me a lot of issues.
Writeup about some of the ideas that went into it:
Not in the file format.