←back to thread

Regex Isn't Hard (2023)

(timkellogg.me)
75 points asicsp | 4 comments | | HN request time: 0.628s | source
Show context
nickez ◴[] No.43750462[source]
Found an error immediately "Any lowercase character" doesn't match all Swedish lowercase characters.
replies(2): >>43750511 #>>43750527 #
comrade1234 ◴[] No.43750527[source]
lol really? Why not? Is that true for all encodings? Is it a bug or a feature? What about a simple character set like gsm-7 Swedish?
replies(2): >>43750578 #>>43750584 #
1. lalaithion ◴[] No.43750584[source]
The author says “any lowercase character” but they mean “any character between the character ‘a’ and the character ‘z’”, which happens to correspond to the lower case letters in English but doesn’t include ü, õ, ø, etc.
replies(2): >>43750991 #>>43751211 #
2. comrade1234 ◴[] No.43750991[source]
I would expect [a-z] to mean any lowercase in any language, not lowercase but only a to z. So I’d get bitten by that one.
replies(1): >>43751137 #
3. deciduously ◴[] No.43751137[source]
The letters with diacritics sort lexicographically after 'z', so it does stand to reason they wouldn't appear in that range.
4. Someone ◴[] No.43751211[source]
> but they mean “any character between the character ‘a’ and the character ‘z’”, which happens to correspond to the lower case letters in English

‘Only’ in the most commonly used character encodings. In EBCDIC (https://en.wikipedia.org/wiki/EBCDIC), the [a-z] range includes more than 26 characters.

That’s one of the reasons POSIX has character classes (https://en.wikipedia.org/wiki/Regular_expression#Character_c...). [:lower:] always gets you the lowercase characters in the encoding that the program uses.