Most active commenters
  • kazinator(9)
  • WalterBright(6)
  • pezezin(5)
  • stkdump(5)
  • mort96(5)
  • josephg(4)
  • Spivak(3)

←back to thread

288 points Twirrim | 95 comments | | HN request time: 2.997s | source | bottom
1. WalterBright ◴[] No.41875254[source]
D made a great leap forward with the following:

1. bytes are 8 bits

2. shorts are 16 bits

3. ints are 32 bits

4. longs are 64 bits

5. arithmetic is 2's complement

6. IEEE floating point

and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.

replies(6): >>41875486 #>>41875539 #>>41875878 #>>41876632 #>>41878715 #>>41881672 #
2. gerdesj ◴[] No.41875486[source]
"1. bytes are 8 bits"

How big is a bit?

replies(8): >>41875621 #>>41875701 #>>41875768 #>>41876060 #>>41876149 #>>41876238 #>>41877432 #>>41877720 #
3. cogman10 ◴[] No.41875539[source]
Yeah, this is something Java got right as well. It got "unsigned" wrong, but it got standardizing primitive bits correct

byte = 8 bits

short = 16

int = 32

long = 64

float = 32 bit IEEE

double = 64 bit IEEE

replies(3): >>41875597 #>>41875634 #>>41877440 #
4. jltsiren ◴[] No.41875597[source]
I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.

On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.

replies(3): >>41875695 #>>41875827 #>>41875847 #
5. poincaredisk ◴[] No.41875621[source]
A bit is either a 0 or 1. A byte is the smallest addressable piece of memory in your architecture.
replies(2): >>41875706 #>>41875737 #
6. josephg ◴[] No.41875634[source]
Yep. Pity about getting chars / string encoding wrong though. (Java chars are 16 bits).

But it’s not alone in that mistake. All the languages invented in that era made the same mistake. (C#, JavaScript, etc).

replies(3): >>41875696 #>>41876204 #>>41876445 #
7. Jerrrrrrry ◴[] No.41875695{3}[source]
hindsight has its advantages
8. paragraft ◴[] No.41875696{3}[source]
What's the right way?
replies(3): >>41875771 #>>41875782 #>>41878247 #
9. CoastalCoder ◴[] No.41875701[source]
> How big is a bit?

A quarter nybble.

10. elromulous ◴[] No.41875706{3}[source]
Technically the smallest addressable piece of memory is a word.
replies(5): >>41876026 #>>41876056 #>>41876953 #>>41877868 #>>41884816 #
11. Nevermark ◴[] No.41875737{3}[source]
Which … if your heap always returns N bit aligned values, for some N … is there a name for that? The smallest heap addressable segment?
12. thamer ◴[] No.41875768[source]
This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.
replies(2): >>41877360 #>>41896800 #
13. WalterBright ◴[] No.41875771{4}[source]
UTF-8

When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.

14. Remnant44 ◴[] No.41875782{4}[source]
utf8, for essentially the reasons mentioned in this manifesto: https://utf8everywhere.org/
replies(1): >>41875952 #
15. jonstewart ◴[] No.41875827{3}[source]
<cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.
replies(1): >>41885165 #
16. kazinator ◴[] No.41875847{3}[source]
> you have to mention the size explicitly

It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".

E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.

Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.

replies(3): >>41875953 #>>41876035 #>>41879486 #
17. Laremere ◴[] No.41875878[source]
Zig is even better:

1. u8 and i8 are 8 bits.

2. u16 and i16 are 16 bits.

3. u32 and i32 are 32 bits.

4. u64 and i64 are 64 bits.

5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.

6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.

The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.

replies(6): >>41876011 #>>41876015 #>>41876480 #>>41876942 #>>41877281 #>>41878278 #
18. josephg ◴[] No.41875952{5}[source]
Yep. Notably supported by go, python3, rust and swift. And probably all new programming languages created from here on.
19. pezezin ◴[] No.41875953{4}[source]
If you don't care about the size of your number, just use isize or usize.

If you do care, then isn't it better to specify it explicitly than trying to guess it and having different compilers disagreeing on the size?

replies(2): >>41875968 #>>41877751 #
20. kazinator ◴[] No.41875968{5}[source]
A type called isize is some kind of size. It looks wrong for something that isn't a size.
replies(1): >>41876423 #
21. __turbobrew__ ◴[] No.41876011[source]
This is the way.
22. Spivak ◴[] No.41876015[source]
How does 5 work in practice? Surely no one is actually checking if their arithmetic overflows, especially from user-supplied or otherwise external values. Is there any use for the normal +?
replies(1): >>41876229 #
23. asveikau ◴[] No.41876026{4}[source]
Depends on your definition of addressable.

Lots of CISC architectures allow memory accesses in various units even if they call general-purpose-register-sized quantities "word".

Iirc the C standard specifies that all memory can be accessed via char*.

24. Spivak ◴[] No.41876035{4}[source]
Is it any better calling it an int where it's assumed to be an i32 and 30 of the bits are wasted.
replies(1): >>41882135 #
25. Maxatar ◴[] No.41876056{4}[source]
I don't think the term word has any consistent meaning. Certainly x86 doesn't use the term word to mean smallest addressable unit of memory. The x86 documentation defines a word as 16 bits, but x86 is byte addressable.

ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.

As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.

26. nonameiguess ◴[] No.41876060[source]
How philosophical do you want to get? Technically, voltage is a continuous signal, but we sample only at clock cycle intervals, and if the sample at some cycle is below a threshold, we call that 0. Above, we call it 1. Our ability to measure whether a signal is above or below a threshold is uncertain, though, so for values where the actual difference is less than our ability to measure, we have to conclude that a bit can actually take three values: 0, 1, and we can't tell but we have no choice but to pick one.

The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.

27. basementcat ◴[] No.41876149[source]
A bit is a measure of information theoretical entropy. Specifically, one bit has been defined as the uncertainty of the outcome of a single fair coin flip. A single less than fair coin would have less than one bit of entropy; a coin that always lands heads up has zero bits, n fair coins have n bits of entropy and so on.

https://en.m.wikipedia.org/wiki/Information_theory

https://en.m.wikipedia.org/wiki/Entropy_(information_theory)

replies(1): >>41876245 #
28. jeberle ◴[] No.41876204{3}[source]
Java strings are byte[]'s if their contents contain only Latin-1 values (the first 256 codepoints of Unicode). This shipped in Java 9.

JEP 254: Compact Strings

https://openjdk.org/jeps/254

29. dullcrisp ◴[] No.41876229{3}[source]
You think no one checks if their arithmetic overflows?
replies(1): >>41876357 #
30. dullcrisp ◴[] No.41876238[source]
At least 2 or 3
31. fourier54 ◴[] No.41876245{3}[source]
That is a bit in information theory. It has nothing to do with the computer/digital engineering term being discussed here.
replies(1): >>41876484 #
32. Spivak ◴[] No.41876357{4}[source]
I'm sure it's not literally no one but I bet the percent of additions that have explicit checks for overflow is for all practical purposes indistinguishable from 0.
replies(1): >>41876729 #
33. pezezin ◴[] No.41876423{6}[source]
Then just define a type alias, which is good practice if you want your types to be more descriptive: https://doc.rust-lang.org/reference/items/type-aliases.html
replies(1): >>41876975 #
34. davidgay ◴[] No.41876445{3}[source]
Java was just unlucky, it standardised it's strings at the wrong time (when Unicode was 16-bit code points): Java was announced in May 1995, and the following comment from the Unicode history wiki page makes it clear what happened: "In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. ..."
35. notfed ◴[] No.41876480[source]
Same deal with Rust.
replies(1): >>41878431 #
36. sirsinsalot ◴[] No.41876484{4}[source]
This comment I feel sure would repulse Shannon in the deepest way. A (digital, stored) bit, abstractly seeks to encode and make useful through computation the properties of information theory.

Your comment must be sarcasm or satire, surely.

replies(1): >>41880541 #
37. stkdump ◴[] No.41876632[source]
I mean practically speaking in C++ we have (it just hasn't made it to the standard):

1. char 8 bit

2. short 16 bit

3. int 32 bit

4. long long 64 bit

5. arithmetic is 2s complement

6. IEEE floating point (float is 32, double is 64 bit)

Along with other stuff like little endian, etc.

Some people just mistakenly think they can't rely on such stuff, because it isn't in the standard. But they forget that having an ISO standard comes on top of what most other languages have, which rely solely on the documentation.

replies(2): >>41876956 #>>41877929 #
38. nox101 ◴[] No.41876729{5}[source]
Lots of secure code checks for overflow

    fillBufferWithData(buffer, data, offset, size)
You want to know that offset + size don't wrap past 32bits (or 64) and end up with nonsense and a security vulnerability.
39. mort96 ◴[] No.41876942[source]
Eh I like the nice names. Byte=8, short=16, int=32, long=64 is my preferred scheme when implementing languages. But either is better than C and C++.
replies(1): >>41878505 #
40. mort96 ◴[] No.41876953{4}[source]
Every ISA I've ever used has used the term "word" to describe a 16- or 32-bit quantity, while having instructions to load and store individual bytes (8 bit quantities). I'm pretty sure you're straight up wrong here.
41. mort96 ◴[] No.41876956[source]
> (it just hasn't made it to the standard)

That's the problem

replies(1): >>41882601 #
42. kazinator ◴[] No.41876975{7}[source]
Nope! Because then you will also define an alias, and Suzy will define an alias, and Bob will define an alias, ...

We should all agree on int and uint; not some isize nonsense, and not bobint or suzyint.

replies(3): >>41877079 #>>41877189 #>>41884530 #
43. jclulow ◴[] No.41877079{8}[source]
Alas, it's pretty clear that we won't!
44. pezezin ◴[] No.41877189{8}[source]
Ok, it is obvious that you are looking for something to complaint about and don't want to find a solution. That is not a productive attitude in life, but whatever floats your boat. Have a good day.
replies(1): >>41882105 #
45. Cloudef ◴[] No.41877281[source]
Zig allows any uX and iX in the range of 1 - 65,535, as well as u0
replies(1): >>41880263 #
46. seoulbigchris ◴[] No.41877360{3}[source]
So trinary and quaternary digits are trits and quits?
replies(1): >>41877866 #
47. zombot ◴[] No.41877432[source]
If your detector is sensitive enough, it could be just a single electron that's either present or absent.
48. pjmlp ◴[] No.41877440[source]
While I don't agree with not having unsigned as part of the primitive times, and look forward to Valhala fixing that, it was based on the experience most devs don't get unsigned arithmetic right.

"For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up."

http://www.gotw.ca/publications/c_family_interview.htm

49. amelius ◴[] No.41877720[source]
Depends on your physical media.
50. heinrich5991 ◴[] No.41877751{5}[source]
Actually, if you don't care about the size of your small number, use `i32`. If it's a big number, use `i64`.

`isize`/`usize` should only be used for memory-related quantities — that's why they renamed from `int`/`uint`.

replies(1): >>41882127 #
51. eqvinox ◴[] No.41877866{4}[source]
Yes, trit is commonly used for ternary logic. "quit" I have never heard in such a context.
52. bregma ◴[] No.41877868{4}[source]
The difference between address A and address A+1 is one byte. By definition.

Some hardware may raise an exception if you attempt to retrieve a value at an address that is not a (greater than 1) multiple of a byte, but that has no bearing on the definition of a byte.

53. bregma ◴[] No.41877929[source]
I work every day with real-life systems where int can be 32 or 64 bits, long long can be 64 or 128 bits, long double can be 64 or 80 or 128 bits, some systems do not have IEEE 754 floating point (no denormals!) some are big endian and some are little endian. These things are not in the language standard because they are not standard in the real world.

Practically speaking, the language is the way it is, and has succeeded so well for so long, because it meets the requirements of its application.

replies(1): >>41882715 #
54. josefx ◴[] No.41878247{4}[source]
I would say anyone mentioning a specific encoding / size just wants to see the world burn. Unicode is variable length on various levels, how many people want to deal with the fact that the unicode of their text could be non normalized or want the ability to cut out individual "char" elements only to get a nonsensical result because the following elements were logically connected to that char? Give developers a decent high level abstraction and don't force them to deal with the raw bits unless they ask for it.
replies(1): >>41881177 #
55. Someone ◴[] No.41878278[source]
LLVM has:

i1 is 1 bit

i2 is 2 bits

i3 is 3 bits

i8388608 is 2^23 bits

(https://llvm.org/docs/LangRef.html#integer-type)

On the other hand, it doesn’t make a distinction between signed and unsigned integers. Users must take care to use special signed versions of operations where needed.

56. loup-vaillant ◴[] No.41878431{3}[source]
I've heard that Rust wraps around by default?
replies(1): >>41878612 #
57. shiomiru ◴[] No.41878505{3}[source]
It would be "nice" if not for C setting a precedent for these names to have unpredictable sizes. Meaning you have to learn the meaning of every single type for every single language, then remember which language's semantics apply to the code you're reading. (Sure, I can, but why do I have to?)

[ui][0-9]+ (and similar schemes) on the other hand anybody can understand at the first glance.

58. Measter ◴[] No.41878612{4}[source]
Rust has two possible behaviours: panic or wrap. By default debug builds with panic, release builds with wrap. Both behaviours are 100% defined, so the compiler can't do any shenanigans.

There are also helper functions and types for unchecked/checked/wrapping/saturating arithmetic.

59. bmacho ◴[] No.41878715[source]
> D made a great leap forward

> and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

Nah. It is actually pretty bad. Type names with explicit sizes (u8, i32, etc) are way better in every way.

replies(1): >>41884134 #
60. itishappy ◴[] No.41879486{4}[source]
Except defining your types with arbitrary names is still hardware dependent, it's just now something you have to remember or guess.

Can you remember the name for a 128 bit integer in your preferred language off the top of your head? I can intuit it in Rust or Zig (and many others).

In D it's... oh... it's int128.

https://dlang.org/phobos/std_int128.html

https://github.com/dlang/phobos/blob/master/std/int128.d

replies(2): >>41885180 #>>41901115 #
61. renox ◴[] No.41880263{3}[source]
u0?? Why?
replies(3): >>41880597 #>>41885321 #>>41885460 #
62. fourier54 ◴[] No.41880541{5}[source]
I do not know or care what would Mr. Shannon think. What I do know is that the base you chose for the logarithm on the entropy equation has nothing to do with the amount of bits you assign to a word on a digital architecture :)
63. xigoi ◴[] No.41880597{4}[source]
To avoid corner cases in auto-generated code?
64. consteval ◴[] No.41881177{5}[source]
I think this is what Rust does, if I remember correctly, it provides APIs in string to enumerate the characters accurately. That meaning, not necessarily byte by byte.
replies(1): >>41882960 #
65. eps ◴[] No.41881672[source]
That's a bit self-pat-on-the-back-ish, isn't it, Mr. Bright, the author of D language? :)
replies(1): >>41885161 #
66. kazinator ◴[] No.41882105{9}[source]
> looking for something to complaint about

You know, that describes pretty much everyone who has anything to do with Rust.

"My ls utility isn't written in Rust, yikes! Let's fix that!"

"The comments under this C++-related HN submission aren't talking about Rust enough, yikes! Let's fix that!"

I'm obviously pointing to a solution: have a standard module that any Rust program can depend on coming from the language, which has a few sanely named types. Rather than every program defining its own.

67. kazinator ◴[] No.41882127{6}[source]
If you use i32, it looks like you care. Without studying the code, I can't be sure that it could be changed to i16 or i64 without breaking something.

Usually, I just want the widest type that is efficient on the machine, and I don't want it to have an inappropriate name. I don't care about the wasted space, because it only matters in large arrays, and often not even then.

replies(1): >>41883443 #
68. kazinator ◴[] No.41882135{5}[source]
what you call things matters, so yes, it is better.
69. stkdump ◴[] No.41882601{3}[source]
You are aware that D and rust and all the other languages this is being compared to don't even have an ISO standard, right?
replies(1): >>41884683 #
70. stkdump ◴[] No.41882715{3}[source]
There are also people who write COBOL for a living. What you say is not relevant at all for 99.99% of C++ code written today. Also, all compilers can be configured to be non-standard compliant in many different ways, the classic example being -fno-exceptions. Nobody says all kinds of using a standardized language must be standard conformant.
71. speedyjay ◴[] No.41882960{6}[source]
https://pastebin.com/raw/D7p7mRLK

My comment in a pastebin. HN doesn't like unicode.

You need this crate to deal with it in Rust, it's not part of the base libraries:

https://crates.io/crates/unicode-segmentation

The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.

By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.

>>> test = " "

>>> [char for char in test]

['', '\u200d', '', '\u200d', '', '\u200d', '']

>>> len(test)

7

In JavaScript REPL (nodejs):

> let test = " "

undefined

> [...new Intl.Segmenter().segment(test)][0].segment;

' '

> [...new Intl.Segmenter().segment(test)].length;

1

Works as it should.

In python you would need a third party library.

Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.

let test = " "

for char in test {

    print(char)
}

print(test.count)

output :

1

[Execution complete with exit code 0]

I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.

But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.

replies(1): >>41884017 #
72. heinrich5991 ◴[] No.41883443{7}[source]
> If you use i32, it looks like you care.

In Rust, that's not really the case. `i32` is the go-to integer type.

`isize` on the other hand would look really weird in code — it's an almost unused integer type. I also prefer having integers that don't depend on the machine I'm running them on.

replies(1): >>41884813 #
73. josephg ◴[] No.41884017{7}[source]
For context, it looks like you’re talking about iterating by grapheme clusters.

I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)

Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.

Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.

replies(1): >>41885770 #
74. WalterBright ◴[] No.41884134[source]
> Type names with explicit sizes (u8, i32, etc) are way better in every way

Until one realizes that the entire namespace of innn, unnn, fnnn, etc., is reserved.

replies(1): >>41895432 #
75. hahamaster ◴[] No.41884530{8}[source]
You insist that we should all agree on something but you don't specify what.
76. mort96 ◴[] No.41884683{4}[source]
Yeah, so their documentation serves as the authority on how you're supposed to write your code for it to be "correct D" or "correct Rust". The compiler implementors write their compilers against the documentation (and vice versa). That documentation is clear on these things.

In C, the ISO standard is the authority on how you're supposed to write your code for it to be "correct C". The compiler implementors write their compilers against the ISO standard. That standard is not clear on these things.

replies(1): >>41885528 #
77. kazinator ◴[] No.41884813{8}[source]
Some 32 bit thing being the go to integer type flies against software engineering and CS.

It's going to get expensive on a machine that has only 64 bit integers, which must be accessed on 8 byte aligned boundaries.

replies(1): >>41892216 #
78. throw16180339 ◴[] No.41884816{4}[source]
That's only true on a word-addressed machine; most CPUs are byte-addressed.
79. WalterBright ◴[] No.41885161[source]
Of course!

Over the years I've known some engineers who, as a side project, wrote some great software. Nobody was interested in it. They'd come to me and ask why that is? I suggest writing articles about their project, and being active on the forums. Otherwise, who would ever know about it?

They said that was unseemly, and wouldn't do it.

They wound up sad and bitter.

The "build it and they will come" is a stupid Hollywood fraud.

BTW, the income I receive from D is $0. It's my gift. You'll also note that I've suggested many times improvements that could be made to C, copying proven ideas in D. Such as this one:

https://www.digitalmars.com/articles/C-biggest-mistake.html

C++ has already adopted many ideas from D.

replies(1): >>41889590 #
80. WalterBright ◴[] No.41885165{4}[source]
That doesn't really fix it, because of the integral promotion rules.
81. WalterBright ◴[] No.41885180{5}[source]
It was actually supposed to be `cent` and`ucent`, but we needed a library type to stand in for it at the moment.
82. Cloudef ◴[] No.41885321{4}[source]
To represent 0 without actually storing it in memory
83. whs ◴[] No.41885460{4}[source]
Sounds like zero-sized types in Rust, where it is used as marker types (eg. this struct own this lifetime). It also can be used to turn a HashMap into a HashSet by storing zero sized value. In Go a struct member of [0]func() (an array of function, with exactly 0 members) is used to make a type uncomparable as func() cannot be compared.
84. stkdump ◴[] No.41885528{5}[source]
I don't think this is true. The target audience of the ISO standard is the implementers of compilers and other tools around the language. Even the people involved in creating it make that clear by publishing other material like the core guidelines, conference talks, books, online articles, etc., which are targeted to the users of the language.
replies(1): >>41886833 #
85. speedyjay ◴[] No.41885770{8}[source]
This was not meant as criticism for rust in particular (though, while it shouldn't be the default behavior of strings in a systems language, surely at least the official implementation of a wrapper should exist?), but high level languages with ton of baggage like python should definitely provide the correct way to handle strings, the amount of software I've seen that are unable to properly handle strings because the language didn't provide the required grapheme handling and the developer was also not aware of the reality of graphemes and unicode..

You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.

Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.

For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.

Notepad handles it correctly: press backspace once, it's deleted, like any normal character.

> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.

This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.

replies(1): >>41886399 #
86. josephg ◴[] No.41886399{9}[source]
Those are good examples! Notably, all of them are in reasonably low level, user-facing code.

Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.

Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.

I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.

> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.

87. mort96 ◴[] No.41886833{6}[source]
Core guidelines, conference talks, books, online articles, etc. are not authoritative. If I really want to know if my C code is correct C, I consult the standard. If the standard and an online article disagrees, the article is wrong, definitionally.
replies(1): >>41901771 #
88. eps ◴[] No.41889590{3}[source]
> https://www.digitalmars.com/articles/C-biggest-mistake.html

To be fair, this one lies on the surface for anyone trying to come up with an improved C. It's one of the first things that gets corrected in nearly all C derivatives.

> C++ has already adopted many ideas from D.

Do you have a list?

Especially for the "adopted from D" bit rather than being a evolutionary and logical improvement to the language.

89. pezezin ◴[] No.41892216{9}[source]
And which machine is that? The only computers that I can think of with only 64-bit integers are the old Cray vector supercomputers, and they used word addressing to begin with.
replies(1): >>41901093 #
90. bmacho ◴[] No.41895432{3}[source]
You are right, they come with a cost.
91. euroderf ◴[] No.41896800{3}[source]
So shouldn't a two-state datum be a twit ?
92. kazinator ◴[] No.41901093{10}[source]
It will likely be common in another 25 to 30 years, as 32 bit systems fade into the past.

Therefore, declaring that int32 is the go to integer type is myopic.

Forty years ago, a program like this could be run on a 16 bit machine (e.g. MS-DOS box):

  #include <stdio.h>

  int main(int argc, char **argv)
  {
    while (argc-- > 0)
      puts(*argv++);
    return 0;
  }
int was 16 bits. That was fine; you would never pass anywhere near 32000 arguments to a program.

Today, that same program does the same thing on a modern machine with a wider int.

Good thing that some int16 had not been declared the go to integer type.

Rust's integer types are deliberately designed (by people who know better) in order to be appealing to people who know shit all about portability and whose brains cannot handle reasoning about types with a bit of uncertainty.

replies(1): >>41901792 #
93. kazinator ◴[] No.41901115{5}[source]
In C it will almost certainly be int128_t, when standardized. 128 bit support is currently a compiler extension (found in GCC, Clang and others).

A type that provides a 128 bit integer exactly should have 128 in its name.

That is not the argument at all.

The problem is only having types like that, and stipulating nonsense like that the primary "go to" integer type is int32.

94. stkdump ◴[] No.41901771{7}[source]
Correction: if you want to know if your compiler is correct, you look at the ISO standard. But even as a compiler writer, the ISO standard is not exhaustive. For example the ISO standard doesn't define stuff like include directories, static or dynamic linking, etc.
95. pezezin ◴[] No.41901792{11}[source]
Sorry, but I fail to see where the problem is. Any general purpose ISA designed in the past 40 years can handle 8/16/32 bit integers just fine regardless of the register size. That includes the 64-bit x86-64 or ARM64 from which you are typing.

The are a few historical architectures that couldn't handle smaller integers, like the first generation Alpha, but:

    a) those are long dead.

    b) hardware engineers learnt from their mistake and no modern general purpose architecture has repeated it (specialized architectures like DSP and GPU are another story though).

    c) worst case scenario, you can simulate it in software.