←back to thread

148 points returningfory2 | 1 comments | | HN request time: 0.575s | source
Show context
mmastrac ◴[] No.43645485[source]
This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust. `char` is a special integer type, known to have a valid range which is a sub-range of its storage type. Outside of dataless enums, this is the only datatype with this behaviour (EDIT: I neglected NonZero<...>/NonZeroXXX and some other zero-niche types).

If you manage to construct an invalid char from an invalid string or any other way, you can defeat the niche optimization code and accidentally create yourself an unsound transmute, which is game over for soundness.

replies(5): >>43645776 #>>43645961 #>>43646463 #>>43646643 #>>43651356 #
NoTeslaThrow ◴[] No.43645776[source]
> This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust.

What does "undefined behavior" mean without a spec? Wouldn't the behavior rustc produces today be de-facto defined behavior? It seems like the contention is violating some transmute constraint, but does this not result in reproducible runtime behavior? In what context are you framing "soundness"?

EDIT: I'm honestly befuddled why anyone would downvote this. I certainly don't think this is detracting from the conversation at all—how can you understand the semantics of the above comment without understanding what the intended meaning of "undefined behavior" or "soundness" is?

replies(5): >>43645920 #>>43645923 #>>43646838 #>>43647769 #>>43648876 #
newpavlov ◴[] No.43645923[source]
"Undefined behavior" means that the compiler can apply optimizations which assume that it does not happen, resulting in an incorrect code. A simpler example is `Option<NonZeroU8>`, the compiler assumes that `NonZeroU8` can never contain 0, thus it can use 0 as value for `None`. Now, if you take a reference to the inner `NonZeroU8` stored in `Some` and write 0 to it, you changed `Some` to `None`, while other optimizations may rely on the assumption that references to the content of `Some` can not flip the enum variant to `None`.

You don't need a full language spec to declare something UB. And, arguably, from the compiler correctness perspective, there is no fundamental difference between walls of prose in the C/C++ "spec" and the "informal spec" currently used by Rust. (Well, there is the CompCert exception, but it's quite far from the mainstream compilers in many regards)

replies(1): >>43645946 #
NoTeslaThrow ◴[] No.43645946[source]
> resulting in an incorrect code.

Incorrect with respect to an assumption furnished where? Your sibling comment mentions RFCs—is this behavior tied to some kind of documented expectation?

> A simpler example is `Option<NonZeroU8>`, the compiler assumes that `NonZeroU8` can never contain 0, thus it can use 0 as value for `None`. Now, if you take a reference to the inner `NonZeroU8` stored in `Some` and write 0 to it, you changed `Some` to `None`, while other optimizations may rely on the assumption that references to the content of `Some` can not flip the enum variant to `None`.

That seems to be the intended behavior, unless I'm reading incorrectly. Why else would you write a 0 to it? Also, does this not require using the `unsafe` keyword? So is tricking the compiler into producing the behavior you described not the expected and intended behavior?

replies(3): >>43646036 #>>43646048 #>>43649155 #
1. fc417fc802 ◴[] No.43646048[source]
It's not intended in that the compiler may have optimized based on the assumption that you have now gone and violated via an unsafe block. Just as in C that would produce undefined behavior in that there's no telling what consequences the optimization might have. The lack of a formal specification isn't relevant.