This is not the first RCE involving YAML and it won't be the last.
This is not the first RCE involving YAML and it won't be the last.
The reason YAML was popularized is because it was a response to XML which isn't user friendly to write. It's unfortunate that the spec got so convoluted, and uses a lot of implicit behavior, but I'd rather write YAML than XML, JSON or TOML for things like configuration files. Nowadays there might be better alternatives, but YAML is the de facto standard.
It's also unfortunate that YAML got abused by people who wanted to turn it into a DSL, so we ended up with thousands of lines of Ansible playbooks, CI workflows, and Helm charts, but here we are.
So that leaves scientific notation.
"\ud83d\udca9"
Python's "PyYAML" package will not decode this to the same result as a JSON decoding.Rust's `serde_yaml` will fail on this.
I don't know about other parsers, but I'd be curious to.
The standard itself isn't well written here, IMO.
> The content of a scalar node is an opaque datum that can be presented as a series of zero or more Unicode characters.
The example here is a "quoted scalar", which can contain the escapes you see. Those escapes represent "Unicode characters", specifically,
> Escaped 16-bit Unicode character.
But "Unicode characters" is never defined by YAML.
Most implementation seem to treat them as Unicode code points, and so thus the resulting string type in almost all cases in something like [UnicodeCodePoint]; in Rust, that means no unpaired surrogates, or we can't convert it to a Rust `String`, which is roughly speaking `[USV]`. In Python, that's workable, since that's Python's `str` datatype, but that means no surrogate decoding occurs.
The grammar also further implies that it's [UnicodeCodePoint] and not [USV], and the prose never restricts unpaired surrogates. (The JSON standard strongly implies the UTF-16 decoding should happen on escaped values, though it too waffles around unpaired surrogates. Whether unpaired surrogates are accepted is variable in JSON.)
But compare with a JSON string: a JSON string decodes to a something like a [USV], so surrogate pairs are decoded to their corresponding USV.