The provenance memory model for C

(gustedt.wordpress.com)

225 points HexDecOctBin | 1 comments | 30 Jun 25 09:25 UTC | HN request time: 0.367s | source

Show context

zombot ◴[30 Jun 25 12:17 UTC] No.44422335[source]▶

Does C allow Unicode identifiers now, or is that pseudo code? The code snippets also contain `&`, so something definitely went wrong with the transcoding to HTML.

replies(4): >>44422382 #>>44422416 #>>44422634 #>>44424896 #

qsort ◴[30 Jun 25 12:21 UTC] No.44422382[source]▶

>>44422335 #

Quoting cppreference:

An identifier is an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters specified using \u and \U escape notation(since C99), of class XID_Continue(since C23). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character(since C99)(until C23), or Unicode character of class XID_Start)(since C23)). Identifiers are case-sensitive (lowercase and uppercase letters are distinct). Every identifier must conform to Normalization Form C.(since C23)

In practice depends on the compiler.

replies(1): >>44422453 #

dgrunwald ◴[30 Jun 25 12:30 UTC] No.44422453[source]▶

>>44422382 #

But the source character set remains implementation-defined, so compilers do not have to directly support unicode names, only the escape notation.

Definitely a questionable choice to throw off readers with unicode weirdness in the very first code example.

replies(1): >>44422534 #

qsort ◴[30 Jun 25 12:39 UTC] No.44422534[source]▶

>>44422453 #

If it were up to me, anything outside the basic character set in a source file would be a syntax error, I'm simply reporting what the spec says.

replies(2): >>44422647 #>>44423260 #

guipsp ◴[30 Jun 25 13:50 UTC] No.44423260[source]▶

>>44422534 #

What a "basic character set" is depends on locale

replies(2): >>44423859 #>>44424381 #

account42 ◴[30 Jun 25 15:16 UTC] No.44424381[source]▶

>>44423260 #

Anything except US-ASCII in source code outside comments and string constants should be a syntax error.

replies(1): >>44424740 #

guipsp ◴[30 Jun 25 15:51 UTC] No.44424740[source]▶

>>44424381 #

You are aware other languages exist? Some of which don't even use the Latin script?

replies(3): >>44424911 #>>44426840 #>>44431566 #

account42 ◴[01 Jul 25 07:43 UTC] No.44431566[source]▶

>>44424740 #

And those are not programming languages, or at least not the C programming language which only needs a very limited character set.

replies(1): >>44436431 #

steveklabnik ◴[01 Jul 25 17:54 UTC] No.44436431[source]▶

>>44431566 #

C does allow for limited unicode in identifiers, though you need to use the \u prefix and write the code out. Compilers like clang let it work like C++ and follow TR31, though this is nonstandard.

replies(1): >>44441249 #

1. account42 ◴[02 Jul 25 08:14 UTC] No.44441249[source]▶

>>44436431 #

Yes, these are the relatively recent additions being discussed here. C and C++ managed just fine for ages without them before the committees decided that scoring brownie points with performative changes was more important than security and readability of source files.

↑