←back to thread

146 points returningfory2 | 1 comments | | HN request time: 0s | source
Show context
mmastrac ◴[] No.43645485[source]
This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust. `char` is a special integer type, known to have a valid range which is a sub-range of its storage type. Outside of dataless enums, this is the only datatype with this behaviour (EDIT: I neglected NonZero<...>/NonZeroXXX and some other zero-niche types).

If you manage to construct an invalid char from an invalid string or any other way, you can defeat the niche optimization code and accidentally create yourself an unsound transmute, which is game over for soundness.

replies(5): >>43645776 #>>43645961 #>>43646463 #>>43646643 #>>43651356 #
NoTeslaThrow ◴[] No.43645776[source]
> This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust.

What does "undefined behavior" mean without a spec? Wouldn't the behavior rustc produces today be de-facto defined behavior? It seems like the contention is violating some transmute constraint, but does this not result in reproducible runtime behavior? In what context are you framing "soundness"?

EDIT: I'm honestly befuddled why anyone would downvote this. I certainly don't think this is detracting from the conversation at all—how can you understand the semantics of the above comment without understanding what the intended meaning of "undefined behavior" or "soundness" is?

replies(5): >>43645920 #>>43645923 #>>43646838 #>>43647769 #>>43648876 #
newpavlov ◴[] No.43645923[source]
"Undefined behavior" means that the compiler can apply optimizations which assume that it does not happen, resulting in an incorrect code. A simpler example is `Option<NonZeroU8>`, the compiler assumes that `NonZeroU8` can never contain 0, thus it can use 0 as value for `None`. Now, if you take a reference to the inner `NonZeroU8` stored in `Some` and write 0 to it, you changed `Some` to `None`, while other optimizations may rely on the assumption that references to the content of `Some` can not flip the enum variant to `None`.

You don't need a full language spec to declare something UB. And, arguably, from the compiler correctness perspective, there is no fundamental difference between walls of prose in the C/C++ "spec" and the "informal spec" currently used by Rust. (Well, there is the CompCert exception, but it's quite far from the mainstream compilers in many regards)

replies(1): >>43645946 #
NoTeslaThrow ◴[] No.43645946{3}[source]
> resulting in an incorrect code.

Incorrect with respect to an assumption furnished where? Your sibling comment mentions RFCs—is this behavior tied to some kind of documented expectation?

> A simpler example is `Option<NonZeroU8>`, the compiler assumes that `NonZeroU8` can never contain 0, thus it can use 0 as value for `None`. Now, if you take a reference to the inner `NonZeroU8` stored in `Some` and write 0 to it, you changed `Some` to `None`, while other optimizations may rely on the assumption that references to the content of `Some` can not flip the enum variant to `None`.

That seems to be the intended behavior, unless I'm reading incorrectly. Why else would you write a 0 to it? Also, does this not require using the `unsafe` keyword? So is tricking the compiler into producing the behavior you described not the expected and intended behavior?

replies(3): >>43646036 #>>43646048 #>>43649155 #
1. newpavlov ◴[] No.43646036{4}[source]
>Incorrect with respect to an assumption furnished where?

In the definition of the `NonZeroU8` type. Or in a more practical terms, in LLVM, when we generate LLVM IR we communicate this property to LLVM and it in turn uses it to apply optimizations to our code.

>Also, does this not require using the `unsafe` keyword?

Yes, it requires `unsafe` and the point is that writing 0 to `NonZeroU8` is UB since it breaks the locality principle critical for correctness of optimizations. Applying just one incorrect (because of the broken assumption) optimization together with numerous other (correct) optimizations can easily lead to very surprising results, which are practically impossible to predict and debug. This is why it's considered such anathema to have UB in code, since having UB in one place may completely break code somewhere far away.