←back to thread

146 points returningfory2 | 8 comments | | HN request time: 0s | source | bottom
Show context
mmastrac ◴[] No.43645485[source]
This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust. `char` is a special integer type, known to have a valid range which is a sub-range of its storage type. Outside of dataless enums, this is the only datatype with this behaviour (EDIT: I neglected NonZero<...>/NonZeroXXX and some other zero-niche types).

If you manage to construct an invalid char from an invalid string or any other way, you can defeat the niche optimization code and accidentally create yourself an unsound transmute, which is game over for soundness.

replies(5): >>43645776 #>>43645961 #>>43646463 #>>43646643 #>>43651356 #
NoTeslaThrow ◴[] No.43645776[source]
> This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust.

What does "undefined behavior" mean without a spec? Wouldn't the behavior rustc produces today be de-facto defined behavior? It seems like the contention is violating some transmute constraint, but does this not result in reproducible runtime behavior? In what context are you framing "soundness"?

EDIT: I'm honestly befuddled why anyone would downvote this. I certainly don't think this is detracting from the conversation at all—how can you understand the semantics of the above comment without understanding what the intended meaning of "undefined behavior" or "soundness" is?

replies(5): >>43645920 #>>43645923 #>>43646838 #>>43647769 #>>43648876 #
mmastrac ◴[] No.43645920[source]
> What does "undefined behavior" mean without a spec?

While not as formalized as C/C++, Rust's "spec" exists in the reference, nomicon, RFCs and documentation. I believe that there is a desire for a spec, but enough resources exist that the community can continue without one with no major negative side-effects (unless you want to re-implement the compiler from scratch, I suppose).

The compiler may exploit "lack of UB" for optimizations, e.g., using a known-invalid value as a niche, optimizing away safety checks, etc.

> Wouldn't the behavior rustc produces today be de-facto defined behavior?

Absolutely not. Bugs are fixed and the behaviour changes. Not often, but it happens.

This post probably answers a lot of your reply as well: https://jacko.io/safety_and_soundness.html

replies(1): >>43645935 #
NoTeslaThrow ◴[] No.43645935[source]
EDIT:

> While not as formalized as C/C++, Rust's "spec" exists in the reference, nomicon, RFCs and documentation. I believe that there is a desire for a spec, but enough resources exist that the community can continue without one with no major negative side-effects (unless you want to re-implement the compiler from scratch, I suppose).

Thank you, I was unaware that this is a thing.

> This post probably answers a lot of your reply as well: https://jacko.io/safety_and_soundness.html

This appears to also rely on "undefined behavior" as a meaningful term.

replies(1): >>43646003 #
mmastrac ◴[] No.43646003[source]
> This appears to also rely on "undefined behavior" as a meaningful term.

I assure you it is a meaningful term:

https://llvm.org/docs/UndefinedBehavior.html

replies(1): >>43646034 #
NoTeslaThrow ◴[] No.43646034[source]
Ok, but in the context of the language at hand? Presumably the IR has distinct semantics from the language that generates the IR. Does UB just strictly resolve to LLVM UB? That's very reasonable!
replies(2): >>43646173 #>>43646377 #
fc417fc802 ◴[] No.43646173[source]
No. UB is a term of art here.

Consider a hypothetical non-LLVM full reimplementation of the compiler. If it optimizes and there are invalid assumptions then there is likely UB. LLVM isn't involved in that case though.

replies(1): >>43646570 #
1. NoTeslaThrow ◴[] No.43646570{3}[source]
> If it optimizes and there are invalid assumptions then there is likely UB.

It's the distinguishing from bugs that concerns me.

replies(3): >>43646620 #>>43646899 #>>43649091 #
2. fc417fc802 ◴[] No.43646620[source]
I don't follow. Isn't UB a subset of bugs or alternatively a follow on consequence that causes observable behavior to further deviate?
replies(1): >>43647058 #
3. vlovich123 ◴[] No.43646899[source]
It is a bug - you’ve violated the contract between the language and the compiler.

Just like segfault or logic bug, it’s a class of bugs. Why is special though is that in most bugs you just hit an invalid state. In UB you can end up executing code that never existed or not executing code that does exist. Or any number of other things can happen because the compiler applies an optimization assuming a runtime state you promised it would never occur but did.

It’s slightly different from being a strict subset because UB is actually exploited to perform optimizations - UB is not allowed so the compiler is able to emit more efficient code is taught to exploit that and the language allows for it (eg the niche optimization the blog describes)

4. NoTeslaThrow ◴[] No.43647058[source]
> Isn't UB a subset of bugs

No, not at all. UB can still produce correct and expected results for the entire input domain.

replies(3): >>43647136 #>>43648979 #>>43649194 #
5. fc417fc802 ◴[] No.43647136{3}[source]
If I have a bug that only triggers between 9 and 10 am EST on Mondays that is still a bug, no? Now extend that to "rand(1.0) < 0.01". Now extend that to a check using __TIME__ that goes off at compile time instead of runtime (some binaries are buggy, some aren't). Now extend that to UB.
6. ◴[] No.43648979{3}[source]
7. lmm ◴[] No.43649091[source]
Anything a compiler does with code which is UB is not a bug in the compiler. That's pretty much the definition of UB.
8. pharrington ◴[] No.43649194{3}[source]
"can" is extremely different than "will"!