On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.

replies(3): >>41875695 #>>41875827 #>>41875847 #

3. josephg ◴[18 Oct 24 01:43 UTC] No.41875634[source]▶

>>41875539 (TP) #

Yep. Pity about getting chars / string encoding wrong though. (Java chars are 16 bits).

But it’s not alone in that mistake. All the languages invented in that era made the same mistake. (C#, JavaScript, etc).

replies(3): >>41875696 #>>41876204 #>>41876445 #

4. Jerrrrrrry ◴[18 Oct 24 01:55 UTC] No.41875695[source]▶

>>41875597 #

hindsight has its advantages

5. paragraft ◴[18 Oct 24 01:55 UTC] No.41875696[source]▶

>>41875634 #

What's the right way?

replies(3): >>41875771 #>>41875782 #>>41878247 #

6. WalterBright ◴[18 Oct 24 02:06 UTC] No.41875771{3}[source]▶

>>41875696 #

UTF-8

When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.

7. Remnant44 ◴[18 Oct 24 02:08 UTC] No.41875782{3}[source]▶

>>41875696 #

utf8, for essentially the reasons mentioned in this manifesto: https://utf8everywhere.org/

replies(1): >>41875952 #

8. jonstewart ◴[18 Oct 24 02:19 UTC] No.41875827[source]▶

>>41875597 #

<cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.

replies(1): >>41885165 #

9. kazinator ◴[18 Oct 24 02:24 UTC] No.41875847[source]▶

>>41875597 #

> you have to mention the size explicitly

It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".

E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.

Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.

replies(3): >>41875953 #>>41876035 #>>41879486 #

10. josephg ◴[18 Oct 24 02:51 UTC] No.41875952{4}[source]▶

>>41875782 #

Yep. Notably supported by go, python3, rust and swift. And probably all new programming languages created from here on.

11. pezezin ◴[18 Oct 24 02:51 UTC] No.41875953{3}[source]▶

>>41875847 #

If you don't care about the size of your number, just use isize or usize.

If you do care, then isn't it better to specify it explicitly than trying to guess it and having different compilers disagreeing on the size?

replies(2): >>41875968 #>>41877751 #

12. kazinator ◴[18 Oct 24 02:54 UTC] No.41875968{4}[source]▶

>>41875953 #

A type called isize is some kind of size. It looks wrong for something that isn't a size.

replies(1): >>41876423 #

13. Spivak ◴[18 Oct 24 03:08 UTC] No.41876035{3}[source]▶

>>41875847 #

Is it any better calling it an int where it's assumed to be an i32 and 30 of the bits are wasted.

replies(1): >>41882135 #

14. jeberle ◴[18 Oct 24 03:53 UTC] No.41876204[source]▶

>>41875634 #

Java strings are byte[]'s if their contents contain only Latin-1 values (the first 256 codepoints of Unicode). This shipped in Java 9.

JEP 254: Compact Strings

https://openjdk.org/jeps/254

15. pezezin ◴[18 Oct 24 04:46 UTC] No.41876423{5}[source]▶

>>41875968 #

Then just define a type alias, which is good practice if you want your types to be more descriptive: https://doc.rust-lang.org/reference/items/type-aliases.html

replies(1): >>41876975 #

16. davidgay ◴[18 Oct 24 04:51 UTC] No.41876445[source]▶

>>41875634 #

Java was just unlucky, it standardised it's strings at the wrong time (when Unicode was 16-bit code points): Java was announced in May 1995, and the following comment from the Unicode history wiki page makes it clear what happened: "In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. ..."

17. kazinator ◴[18 Oct 24 06:55 UTC] No.41876975{6}[source]▶

>>41876423 #

Nope! Because then you will also define an alias, and Suzy will define an alias, and Bob will define an alias, ...

We should all agree on int and uint; not some isize nonsense, and not bobint or suzyint.

replies(3): >>41877079 #>>41877189 #>>41884530 #

18. jclulow ◴[18 Oct 24 07:23 UTC] No.41877079{7}[source]▶

>>41876975 #

Alas, it's pretty clear that we won't!

19. pezezin ◴[18 Oct 24 07:46 UTC] No.41877189{7}[source]▶

>>41876975 #

Ok, it is obvious that you are looking for something to complaint about and don't want to find a solution. That is not a productive attitude in life, but whatever floats your boat. Have a good day.

replies(1): >>41882105 #

20. pjmlp ◴[18 Oct 24 08:38 UTC] No.41877440[source]▶

>>41875539 (TP) #

While I don't agree with not having unsigned as part of the primitive times, and look forward to Valhala fixing that, it was based on the experience most devs don't get unsigned arithmetic right.

"For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up."

http://www.gotw.ca/publications/c_family_interview.htm

21. heinrich5991 ◴[18 Oct 24 09:41 UTC] No.41877751{4}[source]▶

>>41875953 #

Actually, if you don't care about the size of your small number, use `i32`. If it's a big number, use `i64`.

`isize`/`usize` should only be used for memory-related quantities — that's why they renamed from `int`/`uint`.

replies(1): >>41882127 #

22. josefx ◴[18 Oct 24 11:04 UTC] No.41878247{3}[source]▶

>>41875696 #

I would say anyone mentioning a specific encoding / size just wants to see the world burn. Unicode is variable length on various levels, how many people want to deal with the fact that the unicode of their text could be non normalized or want the ability to cut out individual "char" elements only to get a nonsensical result because the following elements were logically connected to that char? Give developers a decent high level abstraction and don't force them to deal with the raw bits unless they ask for it.

replies(1): >>41881177 #

23. itishappy ◴[18 Oct 24 13:52 UTC] No.41879486{3}[source]▶

>>41875847 #

Except defining your types with arbitrary names is still hardware dependent, it's just now something you have to remember or guess.

Can you remember the name for a 128 bit integer in your preferred language off the top of your head? I can intuit it in Rust or Zig (and many others).

In D it's... oh... it's int128.

https://dlang.org/phobos/std_int128.html

https://github.com/dlang/phobos/blob/master/std/int128.d

replies(2): >>41885180 #>>41901115 #

24. consteval ◴[18 Oct 24 16:49 UTC] No.41881177{4}[source]▶

>>41878247 #

I think this is what Rust does, if I remember correctly, it provides APIs in string to enumerate the characters accurately. That meaning, not necessarily byte by byte.

replies(1): >>41882960 #

25. kazinator ◴[18 Oct 24 18:26 UTC] No.41882105{8}[source]▶

>>41877189 #

> looking for something to complaint about

You know, that describes pretty much everyone who has anything to do with Rust.

"My ls utility isn't written in Rust, yikes! Let's fix that!"

"The comments under this C++-related HN submission aren't talking about Rust enough, yikes! Let's fix that!"

I'm obviously pointing to a solution: have a standard module that any Rust program can depend on coming from the language, which has a few sanely named types. Rather than every program defining its own.

26. kazinator ◴[18 Oct 24 18:29 UTC] No.41882127{5}[source]▶

>>41877751 #

If you use i32, it looks like you care. Without studying the code, I can't be sure that it could be changed to i16 or i64 without breaking something.

Usually, I just want the widest type that is efficient on the machine, and I don't want it to have an inappropriate name. I don't care about the wasted space, because it only matters in large arrays, and often not even then.

replies(1): >>41883443 #

27. kazinator ◴[18 Oct 24 18:30 UTC] No.41882135{4}[source]▶

>>41876035 #

what you call things matters, so yes, it is better.

28. speedyjay ◴[18 Oct 24 20:03 UTC] No.41882960{5}[source]▶

>>41881177 #

https://pastebin.com/raw/D7p7mRLK

My comment in a pastebin. HN doesn't like unicode.

You need this crate to deal with it in Rust, it's not part of the base libraries:

https://crates.io/crates/unicode-segmentation

The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.

By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.

>>> test = " "

>>> [char for char in test]

['', '\u200d', '', '\u200d', '', '\u200d', '']

>>> len(test)

In JavaScript REPL (nodejs):

> let test = " "

undefined

> [...new Intl.Segmenter().segment(test)][0].segment;

' '

> [...new Intl.Segmenter().segment(test)].length;

Works as it should.

In python you would need a third party library.

Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.

let test = " "

for char in test {

    print(char)

}

print(test.count)

output :

[Execution complete with exit code 0]

I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.

But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.

replies(1): >>41884017 #

29. heinrich5991 ◴[18 Oct 24 20:59 UTC] No.41883443{6}[source]▶

>>41882127 #

> If you use i32, it looks like you care.

In Rust, that's not really the case. `i32` is the go-to integer type.

`isize` on the other hand would look really weird in code — it's an almost unused integer type. I also prefer having integers that don't depend on the machine I'm running them on.

replies(1): >>41884813 #

30. josephg ◴[18 Oct 24 22:21 UTC] No.41884017{6}[source]▶

>>41882960 #

For context, it looks like you’re talking about iterating by grapheme clusters.

I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)

Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.

Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.

replies(1): >>41885770 #

31. hahamaster ◴[18 Oct 24 23:56 UTC] No.41884530{7}[source]▶

>>41876975 #

You insist that we should all agree on something but you don't specify what.

32. kazinator ◴[19 Oct 24 01:03 UTC] No.41884813{7}[source]▶

>>41883443 #

Some 32 bit thing being the go to integer type flies against software engineering and CS.

It's going to get expensive on a machine that has only 64 bit integers, which must be accessed on 8 byte aligned boundaries.

replies(1): >>41892216 #

33. WalterBright ◴[19 Oct 24 02:23 UTC] No.41885165{3}[source]▶

>>41875827 #

That doesn't really fix it, because of the integral promotion rules.

34. WalterBright ◴[19 Oct 24 02:25 UTC] No.41885180{4}[source]▶

>>41879486 #

It was actually supposed to be `cent` and`ucent`, but we needed a library type to stand in for it at the moment.

35. speedyjay ◴[19 Oct 24 05:09 UTC] No.41885770{7}[source]▶

>>41884017 #

This was not meant as criticism for rust in particular (though, while it shouldn't be the default behavior of strings in a systems language, surely at least the official implementation of a wrapper should exist?), but high level languages with ton of baggage like python should definitely provide the correct way to handle strings, the amount of software I've seen that are unable to properly handle strings because the language didn't provide the required grapheme handling and the developer was also not aware of the reality of graphemes and unicode..

You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.

Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.

For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.

Notepad handles it correctly: press backspace once, it's deleted, like any normal character.

> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.

This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.

replies(1): >>41886399 #

36. josephg ◴[19 Oct 24 08:06 UTC] No.41886399{8}[source]▶

>>41885770 #

Those are good examples! Notably, all of them are in reasonably low level, user-facing code.

Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.

Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.

I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.

> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.

It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.

37. pezezin ◴[20 Oct 24 01:37 UTC] No.41892216{8}[source]▶

>>41884813 #

And which machine is that? The only computers that I can think of with only 64-bit integers are the old Cray vector supercomputers, and they used word addressing to begin with.

replies(1): >>41901093 #

38. kazinator ◴[21 Oct 24 06:02 UTC] No.41901093{9}[source]▶

>>41892216 #

It will likely be common in another 25 to 30 years, as 32 bit systems fade into the past.

Therefore, declaring that int32 is the go to integer type is myopic.

Forty years ago, a program like this could be run on a 16 bit machine (e.g. MS-DOS box):

  #include <stdio.h>

  int main(int argc, char **argv)
  {
    while (argc-- > 0)
      puts(*argv++);
    return 0;
  }

int was 16 bits. That was fine; you would never pass anywhere near 32000 arguments to a program.

Today, that same program does the same thing on a modern machine with a wider int.

Good thing that some int16 had not been declared the go to integer type.

Rust's integer types are deliberately designed (by people who know better) in order to be appealing to people who know shit all about portability and whose brains cannot handle reasoning about types with a bit of uncertainty.

replies(1): >>41901792 #

39. kazinator ◴[21 Oct 24 06:07 UTC] No.41901115{4}[source]▶

>>41879486 #

In C it will almost certainly be int128_t, when standardized. 128 bit support is currently a compiler extension (found in GCC, Clang and others).

A type that provides a 128 bit integer exactly should have 128 in its name.

That is not the argument at all.

The problem is only having types like that, and stipulating nonsense like that the primary "go to" integer type is int32.

40. pezezin ◴[21 Oct 24 08:08 UTC] No.41901792{10}[source]▶

>>41901093 #

Sorry, but I fail to see where the problem is. Any general purpose ISA designed in the past 40 years can handle 8/16/32 bit integers just fine regardless of the register size. That includes the 64-bit x86-64 or ARM64 from which you are typing.

The are a few historical architectures that couldn't handle smaller integers, like the first generation Alpha, but:

    a) those are long dead.

    b) hardware engineers learnt from their mistake and no modern general purpose architecture has repeated it (specialized architectures like DSP and GPU are another story though).

    c) worst case scenario, you can simulate it in software.

↑