The provenance memory model for C

1. zombot ◴[30 Jun 25 12:17 UTC] No.44422335[source]▶

Does C allow Unicode identifiers now, or is that pseudo code? The code snippets also contain `&`, so something definitely went wrong with the transcoding to HTML.

replies(4): >>44422382 #>>44422416 #>>44422634 #>>44424896 #

2. qsort ◴[30 Jun 25 12:21 UTC] No.44422382[source]▶

>>44422335 (TP) #

Quoting cppreference:

An identifier is an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and Unicode characters specified using \u and \U escape notation(since C99), of class XID_Continue(since C23). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character(since C99)(until C23), or Unicode character of class XID_Start)(since C23)). Identifiers are case-sensitive (lowercase and uppercase letters are distinct). Every identifier must conform to Normalization Form C.(since C23)

In practice depends on the compiler.

replies(1): >>44422453 #

3. unwind ◴[30 Jun 25 12:26 UTC] No.44422416[source]▶

>>44422335 (TP) #

I can't even view the post, I just get some kind of content management system-like with the page as JSON or something, in pink-on-white. I'm super confused. :|

The answer to your question seems to (still) be "no".

4. dgrunwald ◴[30 Jun 25 12:30 UTC] No.44422453[source]▶

>>44422382 #

But the source character set remains implementation-defined, so compilers do not have to directly support unicode names, only the escape notation.

Definitely a questionable choice to throw off readers with unicode weirdness in the very first code example.

replies(1): >>44422534 #

5. qsort ◴[30 Jun 25 12:39 UTC] No.44422534{3}[source]▶

>>44422453 #

If it were up to me, anything outside the basic character set in a source file would be a syntax error, I'm simply reporting what the spec says.

replies(2): >>44422647 #>>44423260 #

6. pjmlp ◴[30 Jun 25 12:50 UTC] No.44422634[source]▶

>>44422335 (TP) #

Besides the sibling comment on C23, it does work fine on GCC.

https://godbolt.org/z/qKejzc1Kb

Whereas clang loudly complains,

https://godbolt.org/z/qWrccWzYW

7. ncruces ◴[30 Jun 25 12:51 UTC] No.44422647{4}[source]▶

>>44422534 #

I use unicode for math in comments, and think makes certain complicated formulas far more readable.

replies(1): >>44424358 #

8. guipsp ◴[30 Jun 25 13:50 UTC] No.44423260{4}[source]▶

>>44422534 #

What a "basic character set" is depends on locale

replies(2): >>44423859 #>>44424381 #

9. qsort ◴[30 Jun 25 14:34 UTC] No.44423859{5}[source]▶

>>44423260 #

https://en.cppreference.com/w/c/language/charset.html

10. kzrdude ◴[30 Jun 25 15:14 UTC] No.44424358{5}[source]▶

>>44422647 #

I've just been learning pinyin notation, so now i think the variable řₚ should have a value that first goes down a bit and then up.

replies(1): >>44424558 #

11. account42 ◴[30 Jun 25 15:16 UTC] No.44424381{5}[source]▶

>>44423260 #

Anything except US-ASCII in source code outside comments and string constants should be a syntax error.

replies(1): >>44424740 #

12. zelphirkalt ◴[30 Jun 25 15:33 UTC] No.44424558{6}[source]▶

>>44424358 #

I am not sure it is a good idea to mix such specific phonetic script ideas about diacritic marks with the behavior of the program over time. Even considering the shape, it does not align with the idea of first down a little, then up a lot.

replies(1): >>44432063 #

13. guipsp ◴[30 Jun 25 15:51 UTC] No.44424740{6}[source]▶

>>44424381 #

You are aware other languages exist? Some of which don't even use the Latin script?

replies(3): >>44424911 #>>44426840 #>>44431566 #

14. Y_Y ◴[30 Jun 25 16:06 UTC] No.44424896[source]▶

>>44422335 (TP) #

Implementation-defined until C99, explicitly possible via UCNs aince c99, possible with explicit encoding since C23, but literals are still implementation defined.

15. Y_Y ◴[30 Jun 25 16:07 UTC] No.44424911{7}[source]▶

>>44424740 #

What; like APL‽

16. nottorp ◴[30 Jun 25 19:16 UTC] No.44426840{7}[source]▶

>>44424740 #

Dunno about the OP but I'm very aware as I'm not an english speaker.

I still don't want anything as unpredictable as Unicode in my code. How many different encodings will display as the same variable name and how is the compiler supposed to decide?

If you're thinking of comments and user facing strings, the OP already excluded those.

replies(1): >>44435280 #

17. account42 ◴[01 Jul 25 07:43 UTC] No.44431566{7}[source]▶

>>44424740 #

And those are not programming languages, or at least not the C programming language which only needs a very limited character set.

replies(1): >>44436431 #

18. kzrdude ◴[01 Jul 25 09:19 UTC] No.44432063{7}[source]▶

>>44424558 #

To be sure, it's a joke. Mostly trying to joke at the expense of these excessively complicated variable names (that are only there because it's pseudocode) :)

And yeah, the chinese tone in practice does not align with the idea of "down a little up a lot" either. It depends on context...

19. cryptonector ◴[01 Jul 25 16:03 UTC] No.44435280{8}[source]▶

>>44426840 #

The language and compiler & linker should reject Zalgo in identifiers, and they should reject confusable script mixes in identifiers, but otherwise they treat all equivalent strings as equivalent. To make it easier on the linker compilers should normalize all symbols to one common form (e.g., NFC).

20. steveklabnik ◴[01 Jul 25 17:54 UTC] No.44436431{8}[source]▶

>>44431566 #

C does allow for limited unicode in identifiers, though you need to use the \u prefix and write the code out. Compilers like clang let it work like C++ and follow TR31, though this is nonstandard.

replies(1): >>44441249 #

21. account42 ◴[02 Jul 25 08:14 UTC] No.44441249{9}[source]▶

>>44436431 #

Yes, these are the relatively recent additions being discussed here. C and C++ managed just fine for ages without them before the committees decided that scoring brownie points with performative changes was more important than security and readability of source files.