←back to thread

146 points returningfory2 | 1 comments | | HN request time: 0s | source
Show context
mmastrac ◴[] No.43645485[source]
This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust. `char` is a special integer type, known to have a valid range which is a sub-range of its storage type. Outside of dataless enums, this is the only datatype with this behaviour (EDIT: I neglected NonZero<...>/NonZeroXXX and some other zero-niche types).

If you manage to construct an invalid char from an invalid string or any other way, you can defeat the niche optimization code and accidentally create yourself an unsound transmute, which is game over for soundness.

replies(5): >>43645776 #>>43645961 #>>43646463 #>>43646643 #>>43651356 #
timerol ◴[] No.43645961[source]
> Outside of dataless enums, this is the only datatype with this behaviour.

Note that there are non-zero integer types that can also be used in this way, like NonZeroU8 https://doc.rust-lang.org/std/num/type.NonZeroU8.html. The NULL pointer is also used as a niche, and you can create your own as well, as documented in https://www.0xatticus.com/posts/understanding_rust_niche/

replies(2): >>43646020 #>>43646076 #
1. tialaramex ◴[] No.43646076[source]
Well, in practice you can't make your own non-enum types in stable Rust with this property, to unblock this Rust needs pattern types, (I wanted a path to do this a different way, but I was persuaded it's not a good idea). As that link explains the stdlib is doing this via a perma-unstable compiler use only attribute.

But yes, there are the NonZero integers, and you can make your own NonBlah integer using the "XOR trick" for a relatively tiny performance overhead, as well as you can make enums which is how the current CompactString works.

The link you gave mentions that Rust does this for other types, but in particular OwnedFd is often useful on Unix systems. Option<OwnedFd> has the same implementation as a C file descriptor, but the same ergonomics as a fancy high level data structure, that's the sort of optimisation we're here for.

Alas the Windows equivalent can't do this because different parts of Microsoft use all zeroes and -1 to mean different things, so both are potentially valid.