For some DSP-ish sort of processors I think it doesn't make sense to have addressability at char level, and the gates to support it would be better spent on better 16 and 32 bit multipliers. ::shrugs::
I feel kind of ambivalent about the standards proposal. We already have fixed size types. If you want/need an exact type, that already exists. The non-fixed size types set minimums and allow platforms to set larger sizes for performance reasons.
Having no fast 8-bit level access is a perfectly reasonable decision for a small DSP.
Might it be better instead to migrate many users of char to (u)int8_t?
The proposed alternative of CHAR_BIT congruent to 0 mod 8 also sounds pretty reasonable, in that it captures the existing non-8-bit char platforms and also the justification for non-8-bit char platforms (that if you're not doing much string processing but instead doing all math processing, the additional hardware for efficient 8 bit access is a total waste).
so delegating such by now very very edge cases to non standard C seems fine, i.e. seems to IMHO not change much at all in practice
and C/C++ compilers are anyway full of non standard extensions and it's not that CHAR_BIT go away or you as a non-standard extension assume it might not be 8
One fun fact I found the other day: ASCII is 7 bits, but when it was used with punch cards there was an 8th bit to make sure you didn't punch the wrong number of holes. https://rabbit.eng.miami.edu/info/ascii.html
Yes, indexing strings of 6-bit FIELDATA characters was a huge headache. UNIVAC had the unfortunate problem of having to settle on a character code in the early 1960s, before ASCII was standardized. At the time, a military 6-bit character set looked like the next big thing. It was better than IBM's code, which mapped to punch card holes and the letters weren't all in one block.
[1] https://www.unisys.com/siteassets/collateral/info-sheets/inf...
Honest question, haven't followed closely. rand() is broken,I;m told unfixable and last I heard still wasn't deprecated.
Is this proposal a test? "Can we even drop support for a solution to a problem literally nobody has?"
https://man7.org/linux/man-pages/man3/fgetc.3.html
fgetc(3) and its companions always return character-by-character input as an int, and the reason is that EOF is represented as -1. An unsigned char is unable to represent EOF. If you're using the wrong return value, you'll never detect this condition.
However, if you don't receive an EOF, then it should be perfectly fine to cast the value to unsigned char without loss of precision.
Given that Wikipedia says UNIVAC was discontinued in 1986 I’m pretty sure the answer is no and no!
Or are you suggesting to increase the size of a byte until it's the same size as a word, and merge both concepts ?
Parity is for paper tape, not punched cards. Paper tape parity was never standardized. Nor was parity for 8-bit ASCII communications. Which is why there were devices with settings for EVEN, ODD, ZERO, and ONE for the 8th bit.
Punched cards have their very own encodings, only of historical interest.
At the same time, a `byte` is already an "alias" for `char` since C++17 anyway[1].
std::cout << (int8_t)32 << std::endl; //should print 32 dang it
https://thephd.dev/conformance-should-mean-something-fputc-a...
Me? I just dabble with documenting an unimplemented "50% more bits per byte than the competition!" 12-bit fantasy console of my own invention - replete with inventions such as "UTF-12" - for shits and giggles.
This is so old it predates ANSI C; it's in K&R C. It used to show up on various academic sites. Now it's obsolete enough to have scrolled off Google. I've seen copies of this on various academic sites over the years, but it seems to have finally scrolled off.
I think we can dispense with non 8-bit bytes at this point.
To my naive eye, It seems like moving to 10 bits per byte would be both logical and make learning the trade just a little bit easier?
The land of x86 goes to great pains to eliminate the concept of a word at a silicon cost.
https://github.com/torvalds/linux/blob/master/include/math-e...
Its OS, OS 2200, does have a C compiler. Not sure if there ever was a C++ compiler, if there once was it is no longer around. But that C compiler is not being kept up to date with the latest standards, it only officially supports C89/C90 - this is a deeply legacy system, most application software is written in COBOL and the OS itself itself is mainly written in assembler and a proprietary Pascal-like language called “PLUS”. They might add some features from newer standards if particularly valuable, but formal compliance with C99/C11/C17/C23/etc is not a goal.
The OS does contain components written in C++, most notably the HotSpot JVM. However, from what I understand, the JVM actually runs in x86-64 Linux processes on the host system, outside of the emulated mainframe environment, but the mainframe emulator is integrated with those Linux processes so they can access mainframe files/data/apps.
Which is the real reason why 8-bits should be adopted as the standard byte size.
I didn't even realize that the byte was defined as anything other than 8-bits until recently. I have known, for decades, that there were non-8-bit character encodings (including ASCII) and word sizes were all over the map (including some where word size % 8 != 0). Enough thought about that last point should have helped me realize that there were machines where the byte was not 8-bits, yet the rarity of encountering such systems left me with the incorrect notion that a byte was defined as 8-bits.
Now if someone with enough background to figure it out doesn't figure it out, how can someone without that background figure it out? Someone who has only experienced systems with 8-bit bytes. Someone who has only read books that make the explicit assumption of 8-bit bytes (which virtually every book does). Anything they write has the potential of breaking on systems with a different byte size. The idea of writing portable code because the compiler itself is "standards compliant" breaks down. You probably should modify the standard to ensure the code remains portable by either forcing the compiler for non-8-bit systems to handle the exceptions, or simply admitting that compiler does not portable code for non-8-bit systems.
1. bytes are 8 bits
2. shorts are 16 bits
3. ints are 32 bits
4. longs are 64 bits
5. arithmetic is 2's complement
6. IEEE floating point
and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!
Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.
On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown»/«undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).
10 was tried numerous times at the dawn of computing and… it was found too unwieldy in the circuit design.
I found this snapshot of it, though it's not on the real Dilbert site: https://www.reddit.com/r/linux/comments/73in9/computer_holy_...
https://lscs-software.com/LsCs-Manifesto.html
https://news.ycombinator.com/item?id=41614949
Edit: Fixed typo pointed out by child.
I've only programmed in high level programming languages in 8-bit-byte machines. I can't understand what you mean by this sentence.
So in a 36-bit CPU a word is 36 bits. And a byte isn't a word. But what is a word and how does it differ from a byte?
If you asked me what 32-bit/64-bit means in a CPU, I'd say it's how large memory addresses can be. Is that true for 36-bit CPUs or does it mean something else? If it's something else, then that means 64-bit isn't the "word" of a 64-bit CPU, so what would the word be?
This is all very confusing.
Is this true? 4 ternary bits give you really convenient base 12 which has a lot of desirable properties for things like multiplication and fixed point. Though I have no idea what ternary building blocks would look like so it’s hard to visualize potential hardware.
I do believe you meant to write "cardinal sin," good sir. Unless Qt has not only become sentient but also corporeal when I wasn't looking and gotten close and personal with the C++ standard...
On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.
Another part of it is the fact that it's a lot easier to represent stuff with hex if the bytes line up.
I can represent "255" with "0xFF" which fits nice and neat in 1 byte. However, now if a byte is 10bits that hex no longer really works. You have 1024 values to represent. The max value would be 0x3FF which just looks funky.
Coming up with an alphanumeric system to represent 2^10 cleanly just ends up weird and unintuitive.
A quarter nybble.
I have certainly heard an argument that ternary logic would have been a better choice, if it won over, but it is history now, and we are left with the vestiges of the ternary logic in SQL (NULL values which are semantically «no value» / «undefined» values).
Don’t break perfection!! Just accumulate more perfection.
What we need is a new C++ symbol that reliably references eight bit bytes, without breaking compatibility, or wasting annnnnny opportunity to expand the kitchen sink once again.
I propose “unsigned byte8” and (2’s complement) “signed byte8”. And “byte8” with undefined sign behavior because we can always use some more spice.
“unsigned decimal byte8” and “signed decimal byte8”, would limit legal values to 0 to 10 and -10 to +10.
For the damn accountants.
“unsigned centimal byte8” and “signed centimal byte8”, would limit legal values to 0 to 100 and -100 to +100.
For the damn accountants who care about the cost of bytes.
Also for a statistically almost valid, good enough for your customer’s alpha, data type for “age” fields in databases.
And “float byte8” obviously.
Writing geophysical | military signal and image processing applications on custom DSP clusters is suprisingly straightforward and doesn't need C++.
It's a RISC architecture optimised for DSP | FFT | Array processing with the basic simplification that char text is for hosts, integers and floats are at least 32 bit and 32 bits (or 64) is the smallest addressable unit.
Fantastic architecture to work with for numerics, deep computational pipelines, once "primed" you push in raw aquisition samples in chunks every clock cycle and extract processed moving window data chunks every clock cycle.
A single ASM instruction in a cycle can accumulate totals from vector multiplication and modulo update indexes on three vectors (two inputs and and out).
Not your mama's brainfuck.
When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.
9-bit bytes are pretty common in block RAM though, with the extra bit being used for either for ECC or user storage.
- CHAR_BIT cannot go away; reams of code references it.
- You still need the constant 8. It's better if it has a name.
- Neither the C nor C++ standard will be simplified if CHAR_BIT is declared to be 8. Only a few passages will change. Just, certain possible implementations will be rendered nonconforming.
- There are specialized platforms with C compilers, such as DSP chips, that are not byte addressable machines. They are in current use; they are not museum pieces.
It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".
E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.
Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.
I wouldn't want to lose the Linux humor tho!
https://theminimumyouneedtoknow.com/
https://lscs-software.com/LsCs-Roadmap.html
"Many of us got our first exposure to Qt on OS/2 in or around 1987."
Uh huh.
> someone always has a use case;
No he doesn't. He's just unhinged. The machines this dude bitches about don't even have a modern C++ compiler nor do they support any kind of display system relevant to Qt. They're never going to be a target for Qt. Further irony is this dude proudly proclaims this fork will support nothing but Wayland and Vulkan on Linux.
"the smaller processors like those in sensors, are 1's complement for a reason."
The "reason" is never explained.
"Why? Because nothing is faster when it comes to straight addition and subtraction of financial values in scaled integers. (Possibly packed decimal too, but uncertain on that.)"
Is this a justification for using Unisys mainframes, or is the implication that they are fastest because of 1's complement? (not that this is even close to being true - as any dinosaurs are decomissioned they're fucking replaced with capable but not TOL commodity Xeon CPU based hardware running emulation, I don't think Unisys makes any non x86 hardware anymore) Anyway, may need to refresh that CS education.
There's some rambling about the justification being data conversion, but what serialization protocols mandate 1's complement anyway, and if those exist someone has already implemented 2's complement supporting libraries for the past 50 years since that has been the overwhelming status quo. We somehow manage to deal with endianness and decimal conversions as well.
"Passing 2's complement data to backend systems or front end sensors expecting 1's complement causes catastrophes."
99.999% of every system MIPS, ARM, x86, Power, etc for the last 40 years uses 2's complement, so this has been the normal state of the world since forever.
Also the enterpriseist of languages, Java somehow has survived mandating 2's complement.
This is all very unhinged.
I'm not holding my breath to see this ancient Qt fork fully converted to "modified" Barr spec but that will be a hoot.
1. u8 and i8 are 8 bits.
2. u16 and i16 are 16 bits.
3. u32 and i32 are 32 bits.
4. u64 and i64 are 64 bits.
5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.
6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.
The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.
* p2809 Trivial infinite loops are not Undefined Behavior * p1152 Deprecating volatile * p0907 Signed Integers are Two's Complement * p2723 Zero-initialize objects of automatic storage duration * p2186 Removing Garbage Collection Support
So it is possible to change things!
#define SCHAR_MIN -127
#define SCHAR_MAX 128
Is this two typos or am I missing the joke?Lots of CISC architectures allow memory accesses in various units even if they call general-purpose-register-sized quantities "word".
Iirc the C standard specifies that all memory can be accessed via char*.
> A byte is 8 bits, which is at least large enough to contain the ordinary literal encoding of any element of the basic character set literal character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is bits in a byte.
But instead of the "and is composed" ending, it feels like you'd change the intro to say that "A byte is 8 contiguous bits, which is".
We can also remove the "at least", since that was there to imply a requirement on the number of bits being large enough for UTF-8.
Personally, I'd make a "A byte is 8 contiguous bits." a standalone sentence. Then explain as follow up that "A byte is large enough to contain...".
> It's a desktop on a Linux distro meant to create devices to better/save lives.
If you are creating life critical medical devices you should not be using linux.
ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.
As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.
The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.
And I was tasked with reading a tape with binary data in 8-bit format. Hilarity ensued.
On the other hand, it seems like some sort of concession to the idea that you are entitled to some sort of just world where things make sense and can be reasoned out given your own personal, deeply oversimplified model of what's going on inside the computer. This approach can take you pretty far, but it's a garden path that goes nowhere--eventually you must admit that you know nothing and the best you can do is a formal argument that conditional on the documentation being correct you have constructed a correct program.
This is a huge intellectual leap, and in my personal experience the further you go without being forced to acknowledge it the harder it will be to make the jump.
That said, there seems to be an increasing popularity of physical electronics projects among the novice set these days... hopefully read the damn spec sheet will become the new read the documentation
https://en.m.wikipedia.org/wiki/Information_theory
https://en.m.wikipedia.org/wiki/Entropy_(information_theory)
JEP 254: Compact Strings
You'd have to have some way of testing all previous code in the compilation (pardon my ignorance if this is somehow obvious) to make sure this macro isn't already used. You also risk forking the language with any kind of breaking changes like this. How difficult it would be to test if a previous code base uses a charbit macro and whether it can be updated to the new compiler sounds non obvious. What libraries would then be considered breaking? Would interacting with other compiled code (possibly stupid question) that used charbit also cause problems? Just off the top of my head.
I agree that it sounds nonintuitive. I'd suggest creating a conversion tool first and demonstrating it was safe to use even in extreme cases and then make the conversion. But that's just my unenlightened opinion.
This is because when you insert a value into the map, it has 80 bit precision, and that number of bits is used when comparing the value you are inserting during the traversal of the tree.
After the float is stored in the tree, it's clamped to 32 bits.
This can cause the element to be inserted into into the wrong order in the tree, and this breaks the assumptions of the algorithm leaidng to the crash or infinite loop.
Compiling for 64 bits or explicitly disabling x87 float math makes this problem go away.
I have actually had this bug in production and it was very hard to track down.
s/_BigInt/_BitInt/
Hmm, what do you mean?
Like, no you should not adopt some buggy or untested distro, instead choose each component carefully and disable all un-needed updates...
But that beats working on an unstable, randomly and capriciously deprecated and broken OS (windows/mac over the years), that you can perform zero practical review, casual or otherwise, legal or otherwise, and that insists upon updating and further breaking itself at regular intervals...
Unless you mean to talk maybe about some microkernel with a very simple graphical UI, which, sure yes, much less complexity...
One of the most surprising things about floating-point is that very little is actually IEEE 754; most things are merely IEEE 754-ish, and there's a long tail of fiddly things that are different that make it only -ish.
Your comment must be sarcasm or satire, surely.
Perhaps a way to fill some time would be gradually revealing parts of a convoluted Venn diagram or mind-map of the fiddling things. (That is, assuming there's any sane categorization.)
The question is "does any code need to care about CHAR_BIT > 8 platforms" and the answer of course is no, its just should we perform the occult standards ceremony to acknowledge this, or continue to ritually pretend to standards compliant 16 bit DSPs are a thing.
[1] I'm sure artifacts of 7, 9, 16, 32, etc[2] bit code & platforms exist, but they aren't targeting or implementing anything resembling modern ISO C++ and can continue to exist without anyone's permission.
[2] if we're going for unconventional bitness my money's on 53, which at least has practical uses in 2024
Buuuut then the Unisys thing. Like you say they dont make processors (for the market) and themselves just use Intel now...and even if they make some special secret processors I don't think the IRS is using top secret processors to crunch our taxes, even in the hundreds of millions of record realm with average hundreds of items per record, modern CPUs run at billions of ops per second...so I suspect we are talking some tens of seconds, and some modest amount of RAM (for a server).
The one point he does have is interoperability, which if a lot of (especially medical) equipment uses 1s compliment because its cheaper (in terms of silicon), using "modern" tools is likely to be a bad fit.
Compatability is King, and where medical devices are concerned I would be inclined to agree that not changing things is better than "upgrading" - its all well and good to have two systems until a crisis hits and some doctor plus the wrong sensor into the wrong device...
It'll be interesting if the "-ish" bits are still "-ish" with the current standard.
In fact, Rust has the Eq trait specifically to keep f32/f64s out of hash tables, because NaN breaks them really bad.
1. char 8 bit
2. short 16 bit
3. int 32 bit
4. long long 64 bit
5. arithmetic is 2s complement
6. IEEE floating point (float is 32, double is 64 bit)
Along with other stuff like little endian, etc.
Some people just mistakenly think they can't rely on such stuff, because it isn't in the standard. But they forget that having an ISO standard comes on top of what most other languages have, which rely solely on the documentation.
Powers of two are more natural in a binary computer. Then add the fact that 8 is the smallest power of two that allows you to fit the Latin alphabet plus most common symbols as a character encoding.
We're all about building towers of abstractions. It does make sense to aim for designs that are natural for humans when you're closer to the top of the stack. Bytes are fairly low down the stack, so it makes more sense for them to be natural to computers.
No it’s completely loony. Note that even the devices he claims to work with for medical devices are off the shelf ARM processors (ie what everybody uses). No commonly used commodity processors for embedded have used 1’s complement in the last 50 years.
> equipment uses 1s compliment because its cheaper (in terms of silicon)
Yeah that makes no sense. If you need an ALU at all, 2s complement requires no more silicon and is simpler to work with. That’s why it was recommended by von Neumann in 1945. 1s complement is only simpler if you don’t have an adder of any kind, which is then not a CPU, certainly not a C/C++ target.
Even the shittiest low end PIC microcontroller from the 70s uses 2s complement.
It is possible that a sensing device with no microprocessor or computation of any kind (ie a bare ADC) may generate values in sign-mag or 1s complement (and it’s usually the former, again how stupid this is) - but this has nothing to do with the C implementation of whatever host connects to it which is certainly 2s. I guarantee you no embedded processor this dude ever worked with in the medical industry used anything other than 2s complement - you would have always needed to do a conversion.
This truly is one of the most absurd issues to get wrapped up on. It might be dementia, sadly.
https://github.com/RolandHughes/ls-cs/blob/master/README.md
Maintaining a fork of a large C++ framework (well of another obscure fork) where the top most selling point is a fixation on avoiding C++20 all because they dropped support for integer representations that have no extant hardware with recent C++ compilers - and any theoretical hardware wouldn’t run this framework anyway, that doesn’t seem well attached to reality.
I don't think there is a C++ compiler for the PDP-10. One of the C compiler does have a 36-bit char type.
I kinda like the idea of 6-bit byte retro-microcomputer (resp. 24-bit, that would be a word). Because microcomputers typically deal with small number of objects (and prefer arrays to pointers), it would save memory.
VGA was 6-bit per color, you can have a readable alphabet in 6x4 bit matrix, you can stuff basic LISP or Forth language into 6-bit alphabet, and the original System/360 only had 24-bit addresses.
What's there not to love? 12MiB of memory, with independently addressable 6-bits, should be enough for anyone. And if it's not enough, you can naturally extend FAT-12 to FAT-24 for external storage. Or you can use 48-bit pointers, which are pretty much as useful as 64-bit pointers.
Sure, it is selfish to expect C and C++ do the dirty work, while more modern languages get away skimping on it. On the other hand I think especially C++ is doing itself a disservice trying to become a kind of half-baked Rust.
Wow 11 years for such a banal minimal code trigger. I really don’t quiet understand how we can have the scale of infrastructure in operation when this kind of infrastructure software bugs exist. This is not just gcc. All the working castle of cards is an achievement by itself and also a reminder that good enough is all that is needed.
I also highly doubt you could get a 1 in 1000 developers to successfully debug this issue were it happening in the wild, and much smaller to actually fix it.
Detecting and filtering out NaNs is both trivial and reliable as long as nobody instructs the compiler to break basic floating point operations (so no ffast-math). Dealing with a compiler that randomly changes the values of your variables is much harder.
Such machines are not byte-addressable. They have partial word accesses, instead.
Machines have been built with 4, 8, 12, 16, 24, 32, 36, 48, 56, 60, and 64 bit word lengths.
Many "scientific" computers were built with 36-bit words and a 36-bit arithmetic unit. This started with the IBM 701 (1952), although an FPU came later, and continued through the IBM 7094. The byte-oriented IBM System/360 machines replaced those, and made byte-addressable architecture the standard. UNIVAC followed along with the UNIVAC 1103 (1953), which continued through the 1103A and 1105 vacuum tube machines, the later transistorized machines 1107 and 1108, and well into the 21st century. Unisys will still sell you a 36-bit machine, although it's really an emulator running on Intel Xeon CPUs.
The main argument for 36 bits was that 36-bit floats have four more bits of precision, or one more decimal digit, than 32-bit floats. 1 bit of sign, 8 bits of exponent and 27 bits of mantissa gives you a full 8 decimal digits of precision, while standard 32-bit floats with an 1 bit sign, 7-bit exponent and a 24 bit mantissa only give you 7 full decimal digits. Double precision floating point came years later; it takes 4x as much hardware.
I'm mostly discussing from the sake of it because I don't really mind as a C/C++ user. We could just use "octet" and call it a day, but now there is an ambiguity with the past definition and potential in the future definition (in which case I hope the term "byte" will just disappear).
"For me as a language designer, which I don't really count myself as these days, what "simple" really ended up meaning was could I expect J. Random Developer to hold the spec in his head. That definition says that, for instance, Java isn't -- and in fact a lot of these languages end up with a lot of corner cases, things that nobody really understands. Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex. The language part of Java is, I think, pretty simple. The libraries you have to look up."
8 bit tape? Probably the format the hardware worked in... not actually sure I haven't used real tapes but it's plausible.
36 bit per word computer? Sometimes 0..~4Billion isn't enough. 4 more bits would get someone to 64 billion, or +/- 32 billion.
As it turns out, my guess was ALMOST correct
https://en.wikipedia.org/wiki/36-bit_computing
Paraphrasing, legacy keying systems were based on records of up to 10 printed decimal digits of accuracy for input. 35 bits would be required to match the +/- input but 36 works better as a machine word and operations on 6 x 6 bit (yuck?) characters; or some 'smaller' machines which used a 36 bit larger word and 12 or 18 bit small words. Why the yuck? That's only 64 characters total, so these systems only supported UPPERCASE ALWAYS numeric digits and some other characters.
Exception specifications have been removed, although some want them back for value type exceptions, if that ever happens.
auto_ptr has been removed, given its broken design.
Now on the simplying side, not really, as the old ways still need to be understood.
https://en.wikipedia.org/wiki/Unisys_OS_2200_programming_lan...
As consolation, big-endian will likely live on forever as the network byte order.
Don't worry, most people complaining about C++ complexity don't.
`isize`/`usize` should only be used for memory-related quantities — that's why they renamed from `int`/`uint`.
And this has happened again and again on this enormous codebase that started before it was even called 'C++'.
However, Ord is an ordinary safe trait. So while we're claiming to be totally ordered, we're allowed to be lying, the resulting type is crap but it's not unsafe. So as with sorting the algorithms inside these container types, unlike in C or C++ actually must not blow up horribly when we were lying (or as is common in real software, simply clumsy and mistaken)
The infinite loop would be legal (but I haven't seen it) because that's not unsafe, but if we end up with Undefined Behaviour that's a fault in the container type.
This is another place where in theory C++ gives itself license to deliver better performance at the cost of reduced safety but the reality in existing software is that you get no safety but also worse performance. The popular C++ compilers are drifting towards tacit acceptance that Rust made the right choice here and so as a QoI decision they should ship the Rust-style algorithms.
Some hardware may raise an exception if you attempt to retrieve a value at an address that is not a (greater than 1) multiple of a byte, but that has no bearing on the definition of a byte.
It had to be an unaligned memmove and using a 32 bit binary on a 64 bit system, but still! memmove!
And this bug existed for years.
This caused our database replicas to crash every week or so for a long time.
Practically speaking, the language is the way it is, and has succeeded so well for so long, because it meets the requirements of its application.
Like making over "auto". Or adding "start_lifetime_as" and declaring most existing code that uses mmap non-conformant.
But then someone asks for a thing that would require to stop pretending that C++ can be parsed top down in a single pass. Immediate rejection!
Basically, loading a 0-bit byte from memory gets you a 0. Depositing a 0-bit byte will not alter memory, but may do an ineffective read-modify-write cycle. Incrementing a 0-bit byte pointer will leave it unchanged.
i1 is 1 bit
i2 is 2 bits
i3 is 3 bits
…
i8388608 is 2^23 bits
(https://llvm.org/docs/LangRef.html#integer-type)
On the other hand, it doesn’t make a distinction between signed and unsigned integers. Users must take care to use special signed versions of operations where needed.
sizeof(uint32_t) == 1
[ui][0-9]+ (and similar schemes) on the other hand anybody can understand at the first glance.
There are also helper functions and types for unchecked/checked/wrapping/saturating arithmetic.
> and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!
Nah. It is actually pretty bad. Type names with explicit sizes (u8, i32, etc) are way better in every way.
The next novel chip design is going to have a C++ compiler too. No, we don't yet know what its architecture will be.
> Implementations should support the extended format corresponding to the widest basic format supported.
_if_ it exists, it is required to have at least as many bits as the x87 long double type.¹
The language around extended formats changed in the 2008 standard, but the meaning didn't:
> Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix.
That language is still present in the 2019 standard. So nothing has ever really changed here. Double-extended is recommended, but not required. If it exists, the significand and exponent must be at least as large as those of the Intel 80-bit format, but they may also be larger.
---
¹ At the beginning of the standardization process, Kahan and Intel engineers still hoped that the x87 format would gradually expand in subsequent CPU generations until it became what is now the standard 128b quad format; they didn't understand the inertia of binary compatibility yet. So the text only set out minimum precision and exponent range. By the time the standard was published in 1985, it was understood internally that they would never change the type, but by then other companies had introduced different extended-precision types (e.g. the 96-bit type in Apple's SANE), so it was never pinned down.
Can you remember the name for a 128 bit integer in your preferred language off the top of your head? I can intuit it in Rust or Zig (and many others).
In D it's... oh... it's int128.
As I read it, this link may be describing a hypothetical rather than real compiler. But I did not parse that on initial scan of the Google result.
After digging, I think this is the kind of thing I'm referring to:
https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
https://news.ycombinator.com/item?id=37028310
I've seen other course notes, I think also from Kahan, extolling 80-bit hardware.
Personally I am starting to think that, if I'm really thinking about precision, I had maybe better just use fixed point, but this again is just a "lean" that could prove wrong over time. Somehow we use floats everywhere and it seems to work pretty well, almost unreasonably so.
I am assuming it relates to the kinds of "variable precision floating point with bounds" methods used in CGAL and the like; Googling turns up this survey paper:
https://inria.hal.science/inria-00344355/PDF/p.pdf
Any additional references welcome!
The goal for this guy seems to be a Linux distro primarily to serve as a reproducible dev environment that must include his own in-progress EDT editor clone, but can include others as long as they're not vim or use Qt. Ironically Qt closed-source targets vxWorks and QNX. Dräger ventilators use it for their frontend.
https://www.logikalsolutions.com/wordpress/information-techn...
Like the general idea of a medical device linux distros (for both dev host and targets) is not a bad one. But the thinking and execution in this case is totally derailed due to outsized and unfocused reactions to details that don't matter (ancient IRS tax computers), QtQuick having some growing pains over a decade ago, personal hatred of vim, conflating a hatred of Agile with CI/CD.
(*) For some value of "obviously".
Thanks a lot for your explanation, but does that mean "byte" is any amount of data that can be fetched in a given mode in such machines?
e.g. you have 6-bit, 9-bit, 12-bit, and 18-bit bytes in a 36-bit machine in sixth-word mode, quarter-word mode, third-word mode, and half-word mode, respectively? Which means in full-word mode the "byte" would be 36 bits?
At least, if I understood all of this...
You know, that describes pretty much everyone who has anything to do with Rust.
"My ls utility isn't written in Rust, yikes! Let's fix that!"
"The comments under this C++-related HN submission aren't talking about Rust enough, yikes! Let's fix that!"
I'm obviously pointing to a solution: have a standard module that any Rust program can depend on coming from the language, which has a few sanely named types. Rather than every program defining its own.
Usually, I just want the widest type that is efficient on the machine, and I don't want it to have an inappropriate name. I don't care about the wasted space, because it only matters in large arrays, and often not even then.
References for the actual methods used in Triangle: http://www.cs.cmu.edu/~quake/robust.html
My comment in a pastebin. HN doesn't like unicode.
You need this crate to deal with it in Rust, it's not part of the base libraries:
https://crates.io/crates/unicode-segmentation
The languages that have this kind of feature built-in in the standard library, to my knowledge, are Swift, JavaScript, C# and Java. Swift is the only one, of those four, that treat operating on graphemes as the default. JavaScript requires Intl.Segmenter, C# requires StringInfo, Java requires BreakIterator.
By the way, Python, the language caused so much hurt with their 2.x->3.x transition promising better unicode support in return for this pain couldn't even do this right. There is no concept of graphemes in the standard library. So much for the batteries included bit.
>>> test = " "
>>> [char for char in test]
['', '\u200d', '', '\u200d', '', '\u200d', '']
>>> len(test)
7
In JavaScript REPL (nodejs):
> let test = " "
undefined
> [...new Intl.Segmenter().segment(test)][0].segment;
' '
> [...new Intl.Segmenter().segment(test)].length;
1
Works as it should.
In python you would need a third party library.
Swift is truly the nicest of programming languages as far as strings are concerned. It just works as it always should have been.
let test = " "
for char in test {
print(char)
}print(test.count)
output :
1
[Execution complete with exit code 0]
I, as a non-Apple user, feel quite the Apple envy whenever I think about swift. It's such a nice language, but there's little ecosystem outside of Apple UIs.
But man, no using third party libraries, or working with a wrapper segmenter class or iterator. Just use the base string literals as is. It. Just. Works.
In Rust, that's not really the case. `i32` is the go-to integer type.
`isize` on the other hand would look really weird in code — it's an almost unused integer type. I also prefer having integers that don't depend on the machine I'm running them on.
I understand how iterating through a string by grapheme clusters is convenient for some applications. But it’s far from obvious to me that doing so should be the language’s default. Dealing with grapheme clusters requires a Unicode database, which needs to live somewhere and needs to be updated continuously as Unicode grows. (Should rust statically link that library into every app that uses it?)
Generally there are 3 ways to iterate a string: by UTF8 bytes (or ucs2 code points like Java/js/c#), by Unicode codepoint or by grapheme clusters. UTF8 encoding comes up all the time when encoding / decoding strings - like, to json or when sending content over http. Codepoints are, in my opinion, the correct approach when doing collaborative editing or patching strings. And grapheme clusters are useful in frontend user interfaces - like when building a terminal.
Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least. Tell me - why should grapheme clusters be the default way to iterate over a string? I can see the argument in Swift, which is a language built for frontend UI. But in a systems language like rust? That seems like a terrible default to me. UTF8 bytes are by far the most useful representation for strings in systems code, since from the pov of systems code, strings are usually just data.
Until one realizes that the entire namespace of innn, unnn, fnnn, etc., is reserved.
Modern floating-point is much more reproducible than fixed-point, FWIW, since it has an actual standard that’s widely adopted, and these excess-precision issues do not apply to SSE or ARM FPUs.
In C, the ISO standard is the authority on how you're supposed to write your code for it to be "correct C". The compiler implementors write their compilers against the ISO standard. That standard is not clear on these things.
Over the years I've known some engineers who, as a side project, wrote some great software. Nobody was interested in it. They'd come to me and ask why that is? I suggest writing articles about their project, and being active on the forums. Otherwise, who would ever know about it?
They said that was unseemly, and wouldn't do it.
They wound up sad and bitter.
The "build it and they will come" is a stupid Hollywood fraud.
BTW, the income I receive from D is $0. It's my gift. You'll also note that I've suggested many times improvements that could be made to C, copying proven ideas in D. Such as this one:
https://www.digitalmars.com/articles/C-biggest-mistake.html
C++ has already adopted many ideas from D.
You mention terminals, yes, it's one of the area where graphemes are an absolute must, but pretty much any time you are going to do something to text like deciding "I am going to put a linebreak here so that the text doesn't overflow beyond the box, beyond this A4 page I want to print, beyond the browser's window" grapheme handling is involved.
Any time a user is asked to input something too. I've seen most software take the "iterate over characters" approach to real time user input and they break down things like those emojis into individual components whenever you paste something in.
For that matter, backspace doesn't work properly on software you would expect to do better than that. Put the emoji from my pastebin in Microsoft Edge's search/url bar, then hit backspace, see what happens. While the browser displays the emoji correctly, the input field treats it the way Python segments it in my example: you need to press backspace 7 times to delete it. 7 times! Windows Terminal on the other hand has the quirk of showing a lot of extra spaces after the emoji (despite displaying the emoji correctly too) and will also require 11 backspace to delete it.
Notepad handles it correctly: press backspace once, it's deleted, like any normal character.
> Of those 3 iteration methods, I’ve personally used UTF8 encoding the most and grapheme clusters the least.
This doesn't say anything about grapheme clusters being useless. I've cited examples of popular software doing the wrong thing precisely because, like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.
The dismissiveness over more sane string handling as a standard is not unlike C++ developers pretending that developers are doing the right thing with memory management so we don't need a GC (or rust's ownership paradigm). Nonsense.
Their words, not mine. If lives are on the line you probably shouldn’t be using linux in your medical device. And I hope my life never depends on a medical device running linux.
Intel 80387 was made compliant with the final standard and by that time there were competing FPUs also compliant with the final standard, e.g. Motorola 68881.
The Google BF16 format is useful strictly only for machine learning/AI applications, because its low precision is insufficient for anything else. BF16 has very low precision, but an exponent range equal to FP32, which makes overflows and underflows less likely.
Your examples are implementing custom text input boxes (Excel, Edge), line breaks while printing, and implementing a terminal application. I agree that in all of those cases, grapheme cluster segmentation is appropriate. But that doesn't make grapheme cluster based iteration "the correct way to handle strings". There's no "correct"! There are at least 3 different ways to iterate through a string, and different applications have different needs.
Good languages should make all of these options easy for programmers to use when they need them. Writing a custom input box? Use grapheme clusters. Writing a text based CRDT? Treat a string as a list of unicode codepoints. Writing an HTTP library? Treat the headers and HTML body as ASCII / opaque bytes. Etc.
I take the criticism that rust makes grapheme iteration harder than the others. But eh, rust has truly excellent crates for that within arms reach. I don't see any advantage in moving grapheme based segmentation into std. Well, maybe it would make it easier to educate idiot developers about this stuff. But there's no real technical reason. Its situationally useful - but less useful than lots of other 3rd party crates like rand, tokio and serde.
> like you, they didn't iterate over grapheme clusters. That you never use grapheme iteration might say more about you than it says about grapheme iteration being unneeded.
It says that in 30+ years of programming, I've never programmed a text input field from scratch. Why would I? That's the job of the operating system. Making my own sounds like a huge waste of time.
Is it though? With twos compliment ADD and SUB are the same hardware for unsigned and signed. MUL/IMUL is also the same for the lower half of the result (i.e. 32bit × 32bit = 32bit). So you're ALU and ISA are simple and flexible by design.
To be fair, this one lies on the surface for anyone trying to come up with an improved C. It's one of the first things that gets corrected in nearly all C derivatives.
> C++ has already adopted many ideas from D.
Do you have a list?
Especially for the "adopted from D" bit rather than being a evolutionary and logical improvement to the language.
One that lectures on the importance of college you would think would demonstrate the critical thinking skills to ask themselves why the top supercomputers use 2’s complement like everyone else.
The only aspect of 1’s or sign mag that is simpler is in generation. If you have a simple ADC that gives you a magnitude based on a count and a direction, it is trivial to just output that directly. 1’s I guess is not too much harder with XORs (but what’s the point?). 2’s requires some kind of ripple carry logic, the add 1 is one way, there are some other methods you can work out but still more logic than sign-mag. This is pretty much the only place where non 2’s complement has any advantage. Finally for an I2C or SPI sensor like a temp sensor it is more likely you will get none of the above and have some asymmetric scale. Anybody in embedded bloviating on this ought to know.
In his ramblings the mentions of packed decimal (BCD) are a nice touch. C, C++ has never supported that to begin with so I have no idea why that must also be “considered”.
Therefore, declaring that int32 is the go to integer type is myopic.
Forty years ago, a program like this could be run on a 16 bit machine (e.g. MS-DOS box):
#include <stdio.h>
int main(int argc, char **argv)
{
while (argc-- > 0)
puts(*argv++);
return 0;
}
int was 16 bits. That was fine; you would never pass anywhere near 32000 arguments to a program.Today, that same program does the same thing on a modern machine with a wider int.
Good thing that some int16 had not been declared the go to integer type.
Rust's integer types are deliberately designed (by people who know better) in order to be appealing to people who know shit all about portability and whose brains cannot handle reasoning about types with a bit of uncertainty.
A type that provides a 128 bit integer exactly should have 128 in its name.
That is not the argument at all.
The problem is only having types like that, and stipulating nonsense like that the primary "go to" integer type is int32.
The are a few historical architectures that couldn't handle smaller integers, like the first generation Alpha, but:
a) those are long dead.
b) hardware engineers learnt from their mistake and no modern general purpose architecture has repeated it (specialized architectures like DSP and GPU are another story though).
c) worst case scenario, you can simulate it in software.
I would point out that for any language, if one has to follow the standards committee closely to be an effective programmer in that language, complexity is likely to be an issue. Fortunately in this case it probably isn't required.
I see garbage collection came in c++11 and has now gone. Would following that debacle make many or most c++ programmers more effective?