Breaking Git with a carriage return and cloning RCE

1. 10000truths ◴[08 Jul 25 18:59 UTC] No.44502931[source]▶

This is a big problem with using ad-hoc DSLs for config - there's often no formal specification for the grammar, and so the source of truth for parsing is spread between the home-grown serialization implementation and the home-grown deserialization implementation. If they get out of sync (e.g. someone adds new grammar to the parser but forgets to update the writer), you end up with a parser differential, and tick goes the time bomb. The lesson: have one source of truth, and generate everything that relies on it from that.

replies(3): >>44503902 #>>44504346 #>>44507893 #

2. ajross ◴[08 Jul 25 20:42 UTC] No.44503902[source]▶

>>44502931 (TP) #

Nitpick: the DSL here ("ini file format") is arguably ad-hoc, but it's extremely common and well-understood, and simple enough to make a common law specification work well enough in practice. The bug here wasn't due to the format. What you're actually complaining about is the hand-coded parser[1] sitting in the middle of a C program like a bomb waiting to go off. And, yes, that nonsense should have died decades ago.

There are places for clever hand code, even in C, even in the modern world. Data interchange is very not much not one of them. Just don't do this. If you want .ini, use toml. Use JSON if you don't. Even YAML is OK. Those with a penchant for pain like XML. And if you have convinced yourself your format must be binary (you're wrong, it doesn't), protobufs are there for you.

But absolutely, positively, never write a parser unless your job title is "programming language author". Use a library for this, even if you don't use libraries for anything else.

[1] Fine fine, lexer. We are nitpicking, after all.

replies(4): >>44504124 #>>44505455 #>>44506841 #>>44508225 #

3. heisenbit ◴[08 Jul 25 21:13 UTC] No.44504124[source]▶

>>44503902 #

How many hand crafted lexers dealing with lf vs. cr-lf encodings do exist? My guess is n > ( number of people who coded > 10 KSLOC ).

replies(1): >>44506869 #

4. fpoling ◴[08 Jul 25 21:46 UTC] No.44504346[source]▶

>>44502931 (TP) #

This bug is orthogonal to one source of truth. It is a pure logic bug that could have existed in a standard system library for config files if such existed on Unix.

And consider that consequences of such bug would be much worse if it was in a standard system library. At least here it is limited mostly to developers where machines are updated.

5. anitil ◴[09 Jul 25 01:14 UTC] No.44505455[source]▶

>>44503902 #

If you're using protobuf you can still use text as it turns out [0]

[0] https://protobuf.dev/reference/protobuf/textformat-spec/

6. hnlmorg ◴[09 Jul 25 06:22 UTC] No.44506841[source]▶

>>44503902 #

Since we are nitpicking, git is considerably older than TOML, and nearly as old as YAML and JSON. In fact JSON wasn’t even standardised until after git’s first stable release.

Back then there wasn’t a whole lot of options besides rolling their own.

And when you also consider that git had to be somewhat rushed (due to Linux kernel developers having their licenses revoked for BitKeeper) and the fact that git was originally written and maintained by Linus himself, it’s a little more understandable that he wrote his own config file parser.

Under the same circumstances, I definitely would have done that too. In fact back then, I literally did write my own config file parsers.

7. hnlmorg ◴[09 Jul 25 06:26 UTC] No.44506869{3}[source]▶

>>44504124 #

I’ve written a fair few lexers in my time. My general approach for CR is to simply ignore the character entirely.

If CR is used correctly in windows, then its behaviour is already covered by the LF case (as required for POSIX systems) and if CR is used incorrectly then you end up with all kinds of weird edge cases. So you’re much better off just jumping over that character entirely.

replies(4): >>44507067 #>>44507932 #>>44508402 #>>44512358 #

8. grodriguez100 ◴[09 Jul 25 07:03 UTC] No.44507067{4}[source]▶

>>44506869 #

Old Macs (pre-OS X I think) used CR only as line terminators.

replies(1): >>44508643 #

9. xorcist ◴[09 Jul 25 09:24 UTC] No.44507893[source]▶

>>44502931 (TP) #

The problem here isn't that the parser was updated. The parser and writer did what they did for a reason, that made sense historically but wasn't what the submodule system expected. The submodule system is a bit "tacked on" to the git design and it's not the first time that particular abstraction cracks.

Every file format is underspecified in some way when you use it on enough platforms and protocols, unless the format is actually a reference implementation, and we've had enough problems with those. There's a reason IETF usually demands two independent implementations.

Similar problems can affect case insensitive filesystems, or when moving data between different locales which affect UTF-8 normalization. It's not surprising that an almost identical CVE was just one year ago.

Be careful what you wish for. They could have used yaml instead of ini for the config format, and we would have had more security issues over the years, not less.

replies(1): >>44508216 #

10. ninjaoxygen ◴[09 Jul 25 09:34 UTC] No.44507932{4}[source]▶

>>44506869 #

Ignoring CR is often how two systems end up parsing the same file differently, one as two lines the other as a single line.

If the format is not sensitive to additional empty lines then converting them all CR to LF in-place is likely a safer approach, or a tokenizer that coalesces all sequential CR/LF characters into a single EOL token.

I write a lot of software that parses control protocols, the differences between the firmware from a single manufacturer on different devices is astonishing! I find it shocking the number that actually have no delimiters or packet length.

replies(1): >>44508608 #

11. account42 ◴[09 Jul 25 10:23 UTC] No.44508216[source]▶

>>44507893 #

No, the writer encodes values in a way that they will be read back as different values. This underlying issue is absolutely an encoder/decoder mismatch and has nothing to do with submodules.

12. account42 ◴[09 Jul 25 10:25 UTC] No.44508225[source]▶

>>44503902 #

The parser is fine, the bug is in the encoder which writes a value ending in \r without quotes even though \r is a special character at the end of a line.

13. layer8 ◴[09 Jul 25 10:51 UTC] No.44508402{4}[source]▶

>>44506869 #

More generally, any textual file format where whitespace is significant at the end of a line is calling for trouble.

replies(1): >>44508631 #

14. hnlmorg ◴[09 Jul 25 11:18 UTC] No.44508608{5}[source]▶

>>44507932 #

Why would ignoring CR lead to problems? It has nothing to do with line feeding on any system released in the last quarter of a century.

If you’re targeting iMacs or the Commodore 64, then sure, it’s something to be mindful of. But I’d wager you’d have bigger compatibility problems before you even get to line endings.

Is there some other edge cases regarding CR that I’ve missed? Or are you thinking ultra defensively (from a security standpoint)?

That said, I do like your suggestion of treating CR like LF where the schema isn’t sensitive to line numbering. Unfortunately for my use case, line numbering does matter somewhat. So would be good to understand if I have a ticking time bomb

15. hnlmorg ◴[09 Jul 25 11:20 UTC] No.44508631{5}[source]▶

>>44508402 #

Maybe. But expecting people to remember a ; (or similar) at the end of lines is going to cause more frequent problems from a UX performance.

So you’re better off accepting the edge cases problems that white space introduces considering the benefits outweighs the pain.

replies(1): >>44508764 #

16. hnlmorg ◴[09 Jul 25 11:22 UTC] No.44508643{5}[source]▶

>>44507067 #

Yes, you’re right. I completely forgot about them.

I can’t imagine anyone targeting macOS 9 (or earlier) systems these days but you’re right that it’s an edge case people should be aware of.

17. layer8 ◴[09 Jul 25 11:39 UTC] No.44508764{6}[source]▶

>>44508631 #

That’s not what I meant. It’s okay for the line break itself to be significant. But whitespace immediately preceding the line break shouldn’t be significant, due to its general invisibility.

replies(1): >>44508819 #

18. hnlmorg ◴[09 Jul 25 11:46 UTC] No.44508819{7}[source]▶

>>44508764 #

Is CR considered whitespace? I always thought that was classed as a non-printable control character. But maybe I’m wrong?

Or are you talking about SP preceding CR and/or LF?

replies(1): >>44509582 #

19. layer8 ◴[09 Jul 25 13:06 UTC] No.44509582{8}[source]▶

>>44508819 #

Line breaks are considered whitespace, hence CR is considered whitespace. It is also a control character. This is similar to TAB, or indeed LF.

See here for example: https://en.cppreference.com/w/c/string/byte/isspace

Or here for Unicode: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

20. ummonk ◴[09 Jul 25 16:59 UTC] No.44512358{4}[source]▶

>>44506869 #

I guess at this point it's okay to deprecate Mac OS 9 files?