Most active commenters

NoTeslaThrow(8)
fc417fc802(4)
mmastrac(3)

Popular/hot comments

>>43645946 #
>>43646570 #
>>43647058 #
>>43649500 #

←back to thread

A surprising enum size optimization in the Rust compiler

(jpfennell.com)

Show context

mmastrac ◴[10 Apr 25 16:24 UTC] No.43645485[source]▶

>>43616649 (OP) #

This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust. `char` is a special integer type, known to have a valid range which is a sub-range of its storage type. Outside of dataless enums, this is the only datatype with this behaviour (EDIT: I neglected NonZero<...>/NonZeroXXX and some other zero-niche types).

If you manage to construct an invalid char from an invalid string or any other way, you can defeat the niche optimization code and accidentally create yourself an unsound transmute, which is game over for soundness.

replies(5): >>43645776 #>>43645961 #>>43646463 #>>43646643 #>>43651356 #

1. NoTeslaThrow ◴[10 Apr 25 16:50 UTC] No.43645776[source]▶

>>43645485 #

> This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust.

What does "undefined behavior" mean without a spec? Wouldn't the behavior rustc produces today be de-facto defined behavior? It seems like the contention is violating some transmute constraint, but does this not result in reproducible runtime behavior? In what context are you framing "soundness"?

EDIT: I'm honestly befuddled why anyone would downvote this. I certainly don't think this is detracting from the conversation at all—how can you understand the semantics of the above comment without understanding what the intended meaning of "undefined behavior" or "soundness" is?

replies(5): >>43645920 #>>43645923 #>>43646838 #>>43647769 #>>43648876 #

2. mmastrac ◴[10 Apr 25 17:03 UTC] No.43645920[source]▶

>>43645776 (TP) #

> What does "undefined behavior" mean without a spec?

While not as formalized as C/C++, Rust's "spec" exists in the reference, nomicon, RFCs and documentation. I believe that there is a desire for a spec, but enough resources exist that the community can continue without one with no major negative side-effects (unless you want to re-implement the compiler from scratch, I suppose).

The compiler may exploit "lack of UB" for optimizations, e.g., using a known-invalid value as a niche, optimizing away safety checks, etc.

> Wouldn't the behavior rustc produces today be de-facto defined behavior?

Absolutely not. Bugs are fixed and the behaviour changes. Not often, but it happens.

This post probably answers a lot of your reply as well: https://jacko.io/safety_and_soundness.html

replies(1): >>43645935 #

3. newpavlov ◴[10 Apr 25 17:03 UTC] No.43645923[source]▶

>>43645776 (TP) #

"Undefined behavior" means that the compiler can apply optimizations which assume that it does not happen, resulting in an incorrect code. A simpler example is `Option<NonZeroU8>`, the compiler assumes that `NonZeroU8` can never contain 0, thus it can use 0 as value for `None`. Now, if you take a reference to the inner `NonZeroU8` stored in `Some` and write 0 to it, you changed `Some` to `None`, while other optimizations may rely on the assumption that references to the content of `Some` can not flip the enum variant to `None`.

You don't need a full language spec to declare something UB. And, arguably, from the compiler correctness perspective, there is no fundamental difference between walls of prose in the C/C++ "spec" and the "informal spec" currently used by Rust. (Well, there is the CompCert exception, but it's quite far from the mainstream compilers in many regards)

replies(1): >>43645946 #

4. NoTeslaThrow ◴[10 Apr 25 17:04 UTC] No.43645935[source]▶

>>43645920 #

EDIT:

> While not as formalized as C/C++, Rust's "spec" exists in the reference, nomicon, RFCs and documentation. I believe that there is a desire for a spec, but enough resources exist that the community can continue without one with no major negative side-effects (unless you want to re-implement the compiler from scratch, I suppose).

Thank you, I was unaware that this is a thing.

> This post probably answers a lot of your reply as well: https://jacko.io/safety_and_soundness.html

This appears to also rely on "undefined behavior" as a meaningful term.

replies(1): >>43646003 #

5. NoTeslaThrow ◴[10 Apr 25 17:05 UTC] No.43645946[source]▶

>>43645923 #

> resulting in an incorrect code.

Incorrect with respect to an assumption furnished where? Your sibling comment mentions RFCs—is this behavior tied to some kind of documented expectation?

> A simpler example is `Option<NonZeroU8>`, the compiler assumes that `NonZeroU8` can never contain 0, thus it can use 0 as value for `None`. Now, if you take a reference to the inner `NonZeroU8` stored in `Some` and write 0 to it, you changed `Some` to `None`, while other optimizations may rely on the assumption that references to the content of `Some` can not flip the enum variant to `None`.

That seems to be the intended behavior, unless I'm reading incorrectly. Why else would you write a 0 to it? Also, does this not require using the `unsafe` keyword? So is tricking the compiler into producing the behavior you described not the expected and intended behavior?

replies(3): >>43646036 #>>43646048 #>>43649155 #

6. mmastrac ◴[10 Apr 25 17:11 UTC] No.43646003{3}[source]▶

>>43645935 #

> This appears to also rely on "undefined behavior" as a meaningful term.

I assure you it is a meaningful term:

https://llvm.org/docs/UndefinedBehavior.html

replies(1): >>43646034 #

7. NoTeslaThrow ◴[10 Apr 25 17:13 UTC] No.43646034{4}[source]▶

>>43646003 #

Ok, but in the context of the language at hand? Presumably the IR has distinct semantics from the language that generates the IR. Does UB just strictly resolve to LLVM UB? That's very reasonable!

replies(2): >>43646173 #>>43646377 #

8. newpavlov ◴[10 Apr 25 17:14 UTC] No.43646036{3}[source]▶

>>43645946 #

>Incorrect with respect to an assumption furnished where?

In the definition of the `NonZeroU8` type. Or in a more practical terms, in LLVM, when we generate LLVM IR we communicate this property to LLVM and it in turn uses it to apply optimizations to our code.

>Also, does this not require using the `unsafe` keyword?

Yes, it requires `unsafe` and the point is that writing 0 to `NonZeroU8` is UB since it breaks the locality principle critical for correctness of optimizations. Applying just one incorrect (because of the broken assumption) optimization together with numerous other (correct) optimizations can easily lead to very surprising results, which are practically impossible to predict and debug. This is why it's considered such anathema to have UB in code, since having UB in one place may completely break code somewhere far away.

9. fc417fc802 ◴[10 Apr 25 17:16 UTC] No.43646048{3}[source]▶

>>43645946 #

It's not intended in that the compiler may have optimized based on the assumption that you have now gone and violated via an unsafe block. Just as in C that would produce undefined behavior in that there's no telling what consequences the optimization might have. The lack of a formal specification isn't relevant.

10. fc417fc802 ◴[10 Apr 25 17:28 UTC] No.43646173{5}[source]▶

>>43646034 #

No. UB is a term of art here.

Consider a hypothetical non-LLVM full reimplementation of the compiler. If it optimizes and there are invalid assumptions then there is likely UB. LLVM isn't involved in that case though.

replies(1): >>43646570 #

11. vollbrecht ◴[10 Apr 25 17:49 UTC] No.43646377{5}[source]▶

>>43646034 #

You can find a general overview for the language at hand in "The rust reference"[1]. For a more formal document, you can have a look in to the ferroscene language specification list of undefined behaviour[2] section. From there you can jump to different section, and see legality rules, and undefined behavior sections for each.

The ferroscene language spec was recently donated to the rust foundation.

[1] https://doc.rust-lang.org/reference/behavior-considered-unde... [2] https://spec.ferrocene.dev/undefined-behavior.html

12. NoTeslaThrow ◴[10 Apr 25 18:08 UTC] No.43646570{6}[source]▶

>>43646173 #

> If it optimizes and there are invalid assumptions then there is likely UB.

It's the distinguishing from bugs that concerns me.

replies(3): >>43646620 #>>43646899 #>>43649091 #

13. fc417fc802 ◴[10 Apr 25 18:13 UTC] No.43646620{7}[source]▶

>>43646570 #

I don't follow. Isn't UB a subset of bugs or alternatively a follow on consequence that causes observable behavior to further deviate?

replies(1): >>43647058 #

14. ◴[10 Apr 25 18:35 UTC] No.43646838[source]▶

>>43645776 (TP) #

15. vlovich123 ◴[10 Apr 25 18:40 UTC] No.43646899{7}[source]▶

>>43646570 #

It is a bug - you’ve violated the contract between the language and the compiler.

Just like segfault or logic bug, it’s a class of bugs. Why is special though is that in most bugs you just hit an invalid state. In UB you can end up executing code that never existed or not executing code that does exist. Or any number of other things can happen because the compiler applies an optimization assuming a runtime state you promised it would never occur but did.

It’s slightly different from being a strict subset because UB is actually exploited to perform optimizations - UB is not allowed so the compiler is able to emit more efficient code is taught to exploit that and the language allows for it (eg the niche optimization the blog describes)

16. NoTeslaThrow ◴[10 Apr 25 19:01 UTC] No.43647058{8}[source]▶

>>43646620 #

> Isn't UB a subset of bugs

No, not at all. UB can still produce correct and expected results for the entire input domain.

replies(3): >>43647136 #>>43648979 #>>43649194 #

17. fc417fc802 ◴[10 Apr 25 19:11 UTC] No.43647136{9}[source]▶

>>43647058 #

If I have a bug that only triggers between 9 and 10 am EST on Mondays that is still a bug, no? Now extend that to "rand(1.0) < 0.01". Now extend that to a check using __TIME__ that goes off at compile time instead of runtime (some binaries are buggy, some aren't). Now extend that to UB.

18. duckerude ◴[10 Apr 25 20:31 UTC] No.43647769[source]▶

>>43645776 (TP) #

It means that anything strange that happens next isn't a language bug.

Whether something is a bug or not is sometimes hard to pin down because there's no formal spec. Most of the time it's pretty clear though. Most software doesn't have a formal spec and manages to categorize bugs anyway.

replies(1): >>43649494 #

19. ben0x539 ◴[10 Apr 25 23:21 UTC] No.43648876[source]▶

>>43645776 (TP) #

> I'm honestly befuddled why anyone would downvote this.

I think there's two parts to this. First, there's a bit of a history of people making disingenious jabs at Rust for not having an "ISO C++" style spec. Typically people would try to suggest that Rust can't be ready for production or shouldn't receive support in other ecosystems without being certified by some manner of international committee. Second, Rust by now has an extensive tradition of people discussing memory safety invariants, what soundness means, formal models of what is a valid memory access, desirable optimizations, etc, etc, so your question what undefined behavior means could be taken to be, like, polemically reductive or dismissive.

In context I don't think it's what you're doing, but I would also not be surprised if a lot of people reading Rust-related HN discussions are just super tired of anything that even slightly looks like an effort to re-litigate undefined behavior from first principles, because it tends to derail more specific discussions.

replies(2): >>43649500 #>>43652296 #

20. ◴[10 Apr 25 23:41 UTC] No.43648979{9}[source]▶

>>43647058 #

21. lmm ◴[10 Apr 25 23:59 UTC] No.43649091{7}[source]▶

>>43646570 #

Anything a compiler does with code which is UB is not a bug in the compiler. That's pretty much the definition of UB.

22. lmm ◴[11 Apr 25 00:08 UTC] No.43649155{3}[source]▶

>>43645946 #

> is tricking the compiler into producing the behavior you described not the expected and intended behavior?

It might be what that programmer intended and expected, but they should not expect it. E.g. the current compiler might check for 0, and a future more optimized compiler might optimize out that check (because it knows the Option is not None) and then e.g. perform an out-of-bounds array access (if you were using that NonZeroU8 as an index into some kind of 1-based array).

23. pharrington ◴[11 Apr 25 00:17 UTC] No.43649194{9}[source]▶

>>43647058 #

"can" is extremely different than "will"!

24. NoTeslaThrow ◴[11 Apr 25 01:15 UTC] No.43649494[source]▶

>>43647769 #

> It means that anything strange that happens next isn't a language bug.

This is even more vague. The language is getting blamed regardless. This makes no sense.

replies(1): >>43651143 #

25. NoTeslaThrow ◴[11 Apr 25 01:16 UTC] No.43649500[source]▶

>>43648876 #

Tbh, I just really hate the term "undefined behavior". It really feels like laziness in terms of what the possible damage might entail.

replies(3): >>43650357 #>>43651484 #>>43661990 #

26. arlort ◴[11 Apr 25 03:52 UTC] No.43650357{3}[source]▶

>>43649500 #

It is a term of art in compilers/language design though, isn't it?

If you break an invariant the compiler is relying on for optimization then you can't say for sure what the effect after all optimisation passes or in future versions of the compiler will be. It's just "undefined"

27. dwattttt ◴[11 Apr 25 06:47 UTC] No.43651143{3}[source]▶

>>43649494 #

No: the language defined that e.g. a NonZeroU8 can't contain 0, and the only way it could is via illegal means. You don't need a formal proof to describe that.

To try to characterise what any compiler, hypothetical or not, does if you nonetheless produce one (again, via means that aren't valid) isn't meaningful.

28. imtringued ◴[11 Apr 25 08:03 UTC] No.43651484{3}[source]▶

>>43649500 #

Yeah I personally think the problem isn't undefined behavior itself, but the C development culture where undefined behavior is sprinkled all over the language to the point where it has become unavoidable plus the inevitable assignment of blame onto C developers, because everyone knows there is enough time in the day for fuzzing your entire code base.

29. zozbot234 ◴[11 Apr 25 10:15 UTC] No.43652296[source]▶

>>43648876 #

> Second, Rust by now has an extensive tradition of people discussing memory safety invariants, what soundness means, formal models of what is a valid memory access

Rust is still lacking a definitive formal model of "soundness" in unsafe code. I'm not sure why you're suggesting that this is not a valid criticism or remark, it's just a fact.

replies(1): >>43659145 #

30. ben0x539 ◴[11 Apr 25 22:02 UTC] No.43659145{3}[source]▶

>>43652296 #

Showing up out of nowhere pretending like they haven't even thought about what it means isn't helpful though.

31. Dylan16807 ◴[12 Apr 25 06:53 UTC] No.43661990{3}[source]▶

>>43649500 #

In a situation like this, causing UB is basically saying you deliberately corrupted your memory.

How are you supposed to be specific about what the possible damage might entail for corrupted memory? If you have a function with an "if" or a "while" or a "switch" in it, and you break the variable being evaluated, you might cause the program to skip over the choices and run whatever happens to be next in memory. What's the non-lazy listing of possible outcomes at that point?

↑