←back to thread

148 points returningfory2 | 1 comments | | HN request time: 0.212s | source
Show context
mmastrac ◴[] No.43645485[source]
This is a great way to see why invalid UTF-8 strings and unicode chars cause undefined behaviour in Rust. `char` is a special integer type, known to have a valid range which is a sub-range of its storage type. Outside of dataless enums, this is the only datatype with this behaviour (EDIT: I neglected NonZero<...>/NonZeroXXX and some other zero-niche types).

If you manage to construct an invalid char from an invalid string or any other way, you can defeat the niche optimization code and accidentally create yourself an unsound transmute, which is game over for soundness.

replies(5): >>43645776 #>>43645961 #>>43646463 #>>43646643 #>>43651356 #
hinkley ◴[] No.43646463[source]
I seem to recall someone posting a ridiculously fast utf-8 validator here based on SIMD instructions. Nothing is free but some things can be dirt cheap.
replies(1): >>43649353 #
1. wolf550e ◴[] No.43649353[source]
simdutf [1] from the same people who did simdjson [2]

1 - https://simdutf.github.io/simdutf/

2 - https://simdjson.org/