However, for high-performance systems software specifically, objects often have intrinsically ambiguous ownership and lifetimes that are only resolvable at runtime. Rust has a pretty rigid view of such things. In these cases C++ is much more ergonomic because objects with these properties are essentially outside the Rust model.
In my own mental model, Rust is what Java maybe should have been. It makes too many compromises for low-level systems code such that it has poor ergonomics for that use case.
What is the evidence for this? Plenty of high-performance systems software (browsers, kernels, web servers, you name it) has been written in Rust. Also Rust does support runtime borrow-checking with Rc<RefCell<_>>. It's just less ergonomic than references, but it works just fine.
A trivial example is multiplication of large square matrices. An implementation needs to leverage all available CPU cores, and a traditional way to do that you’ll find in many BLAS libraries – compute different tiles of the output matrix on different CPU cores. A tile is not a continuous slice of memory, it’s a rectangular segment of a dense 2D array. Storing different tiles of the same matrix in parallel is trivial in C++, very hard in Rust.
The near impossibility of building a competitive high-performance I/O scheduler in safe Rust is almost a trope at this point in serious performance-engineering circles.
To be clear, C++ is not exactly comfortable with this either but it acknowledges that these cases exist and provides tools to manage it. Rust, not so much.
Thankfully C# has mostly catched up with those languages, as the other language I enjoy using.
After that, is the usual human factor on programming languages adoption.
Most of my applications are written in C#.
C# provides memory safety guarantees very comparable to Rust, other safety guarantees are better (an example is compiler option to convert integer overflows into runtime exceptions), is a higher level language, great and feature-rich standard library, even large projects compile in a few seconds, usable async IO, good quality GUI frameworks… Replacing C# with Rust would not be a benefit.
For your concrete example of subdividing matrixes, that seems like it should be fairly straightforward in Rust too, if you convert your mutable reference to the data into a pointer, wrap your pointer arithmetic shenanigans in an unsafe block and add a comment at the top saying more or less "this is safe because the different subprograms are always operating on disjoint subsets of the data, and therefore no mutable aliasing can occur"?
FWIW in the case where you're not separating code via a dynamic library boundary, you give the compiler an opportunity to optimise across those unsafe usages, e.g. inlining opportunities for the unsafe code into callers.
Yeah, and that model is rather old: https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule In practice, complex software systems have been written in multiple languages for decades. The requirements of performance-critical low-level components and high-level logic are too different and they are in conflict.
> you give the compiler an opportunity to optimise across those unsafe usages
One workaround is better design of the DLL API. Instead of implementing performance-critical outer layers in C#, do so on the C++ side of the interop, possibly injecting C# dependencies via function pointers or an abstract interface.
Another option is to re-implement these smaller functions in C#. Modern .NET runtime is not terribly slow; it even supports SIMD intrinsics. You are unlikely to match the performance of an optimised C++ release build with LTO, but it’s unlikely to fall significantly short.
On some workloads (think calls not possible to inline within a hot loop), I found LTO to be a requirement for C code to match C# performance, not the other way around. We've come a long way!
(if you ask if there are any caveats - yes, JIT is able to win additional perf. points by not being constrained with SSE2/4.2 and by shipping more heavily vectorized primitives OOB which allow doing single-line changes that outpace what the average C library has access to)
Yeah, I observed that too. As far as I remember, that code did many small memory allocations, and .NET GC was faster than malloc.
However, last time I tested (used .NET 6 back then), for code which churches numbers with AVX, my C++ with SIMD intrinsics was faster than C# with SIMD intrinsics. Not by much but noticeable, like 20%. The code generator was just better in C++. I suspect the main reason is .NET JIT compiler doesn’t have time for expensive optimisations.
Yeah, there are heavy constraints on how many phases there are and how much work each phase can do. Besides inlining budget, there are many hidden "limits" within the compiler which reduce the risk of throughput loss.
For example - JIT will only be able to track so many assertions about local variables at the same time, and if the method has too many blocks, it may not perfectly track them across the full span of them.
GCC and LLVM are able to leisurely repeat optimization phases where-as RyuJIT avoids it (even if some phases replicate some optimizations happened earlier). This will change once "Opt Repeat" feature gets productized[0], we will most likely see it in NativeAOT first, as you'd expect.
On matching codegen quality produced by GCC for vectorized code - I'm usually able to replicate it by iteratively refactoring the implementation and quickly testing its disasm with Disasmo extension. The main catch with this type of code is that GCC, LLVM and ILC/RyuJIT each have their own quirks around SIMD (e.g. does the compiler mistakenly rematerialize vector constant construction inside the loop body, undoing you hosting its load?). Previously, I thought it was a weakness unique to .NET but then I learned that GCC and LLVM tend to also be vulnerable to that, and even regress across updates as it sometimes happens in SIMD edge cases in .NET. But it is certainly not as common. What GCC/LLVM are better at is if you start abstracting away your SIMD code in which case it may need more help as once you start exhausting available registers due to sometimes less than optimal register allocation you start getting spills or you may be running in a technically correct behavior around vector shuffles where JIT needs to replicate portable behavior but fails to see your constant does not need it so you need to reach out for platform-specific intrinsics to work around it.
This is the opposite of what I was suggesting though; those function pointers or abstract interfaces inhibit the kind of optimisations I was suggesting (e.g. inlining causing dead code removal of bounds checks, or inlining comparison functions into sort implementations, classics).
EDIT: that said, it's definitely still possible to not let it impact performance, it just takes being somewhat careful when making the interface, which you don't have to think about if it's all the same compiler/link step