Flattening Rust’s learning curve

(corrode.dev)

460 points birdculture | 3 comments | 13 May 25 22:25 UTC | HN request time: 0.867s | source

Show context

sesm ◴[14 May 25 01:11 UTC] No.43979679[source]▶

Is there a concise document that explains major decisions behind Rust language design for those who know C++? Not a newbie tutorial, just straight to the point: why in-place mutability instead of other options, why encourage stack allocation, what problems with C++ does it solve and at what cost, etc.

replies(5): >>43979717 #>>43979806 #>>43980063 #>>43982558 #>>43984758 #

jandrewrogers ◴[14 May 25 02:21 UTC] No.43980063[source]▶

>>43979679 #

Rust has better defaults for types than C++, largely because the C++ defaults came from C. Rust is more ergonomic in this regard. If you designed C++ today, it would likely adopt many of these defaults.

However, for high-performance systems software specifically, objects often have intrinsically ambiguous ownership and lifetimes that are only resolvable at runtime. Rust has a pretty rigid view of such things. In these cases C++ is much more ergonomic because objects with these properties are essentially outside the Rust model.

In my own mental model, Rust is what Java maybe should have been. It makes too many compromises for low-level systems code such that it has poor ergonomics for that use case.

replies(3): >>43980292 #>>43980421 #>>43981221 #

Const-me ◴[14 May 25 03:22 UTC] No.43980421[source]▶

>>43980063 #

Interestingly, CPU-bound high-performance systems are also incompatible with Rust’s model. Ownership for them is unambiguous, but Rust has another issue, doesn’t support multiple writeable references of the same memory accessed by multiple CPU cores in parallel.

A trivial example is multiplication of large square matrices. An implementation needs to leverage all available CPU cores, and a traditional way to do that you’ll find in many BLAS libraries – compute different tiles of the output matrix on different CPU cores. A tile is not a continuous slice of memory, it’s a rectangular segment of a dense 2D array. Storing different tiles of the same matrix in parallel is trivial in C++, very hard in Rust.

replies(2): >>43981043 #>>43982067 #

winrid ◴[14 May 25 05:11 UTC] No.43981043[source]▶

>>43980421 #

Hard in safe rust. you can just use unsafe in that one area and still benefit in most of your application from safe rust.

replies(1): >>43981264 #

Const-me ◴[14 May 25 05:55 UTC] No.43981264[source]▶

>>43981043 #

I don’t use C++ for most of my applications. I only use C++ to build DLLs which implement CPU-bound performance sensitive numeric stuff, and sometimes to consume C++ APIs and third-party libraries.

Most of my applications are written in C#.

C# provides memory safety guarantees very comparable to Rust, other safety guarantees are better (an example is compiler option to convert integer overflows into runtime exceptions), is a higher level language, great and feature-rich standard library, even large projects compile in a few seconds, usable async IO, good quality GUI frameworks… Replacing C# with Rust would not be a benefit.

replies(3): >>43981870 #>>43983490 #>>43986327 #

dwattttt ◴[14 May 25 12:03 UTC] No.43983490[source]▶

>>43981264 #

It does sound like quite a similar model; unsafe Rust in self contained regions, safe in the majority of areas.

FWIW in the case where you're not separating code via a dynamic library boundary, you give the compiler an opportunity to optimise across those unsafe usages, e.g. inlining opportunities for the unsafe code into callers.

replies(1): >>43987535 #

Const-me ◴[14 May 25 18:10 UTC] No.43987535[source]▶

>>43983490 #

> quite a similar model

Yeah, and that model is rather old: https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule In practice, complex software systems have been written in multiple languages for decades. The requirements of performance-critical low-level components and high-level logic are too different and they are in conflict.

> you give the compiler an opportunity to optimise across those unsafe usages

One workaround is better design of the DLL API. Instead of implementing performance-critical outer layers in C#, do so on the C++ side of the interop, possibly injecting C# dependencies via function pointers or an abstract interface.

Another option is to re-implement these smaller functions in C#. Modern .NET runtime is not terribly slow; it even supports SIMD intrinsics. You are unlikely to match the performance of an optimised C++ release build with LTO, but it’s unlikely to fall significantly short.

replies(2): >>43987903 #>>43990873 #

1. neonsunset ◴[14 May 25 18:45 UTC] No.43987903[source]▶

>>43987535 #

> LTO

On some workloads (think calls not possible to inline within a hot loop), I found LTO to be a requirement for C code to match C# performance, not the other way around. We've come a long way!

(if you ask if there are any caveats - yes, JIT is able to win additional perf. points by not being constrained with SSE2/4.2 and by shipping more heavily vectorized primitives OOB which allow doing single-line changes that outpace what the average C library has access to)

replies(1): >>43988370 #

2. Const-me ◴[14 May 25 19:37 UTC] No.43988370[source]▶

>>43987903 (TP) #

> on some workloads, I found LTO to be a requirement for C code to match C# performance

Yeah, I observed that too. As far as I remember, that code did many small memory allocations, and .NET GC was faster than malloc.

However, last time I tested (used .NET 6 back then), for code which churches numbers with AVX, my C++ with SIMD intrinsics was faster than C# with SIMD intrinsics. Not by much but noticeable, like 20%. The code generator was just better in C++. I suspect the main reason is .NET JIT compiler doesn’t have time for expensive optimisations.

replies(1): >>43988785 #

3. neonsunset ◴[14 May 25 20:23 UTC] No.43988785[source]▶

>>43988370 #

> The code generator was just better in C++. I suspect the main reason is .NET JIT compiler doesn’t have time for expensive optimisations.

Yeah, there are heavy constraints on how many phases there are and how much work each phase can do. Besides inlining budget, there are many hidden "limits" within the compiler which reduce the risk of throughput loss.

For example - JIT will only be able to track so many assertions about local variables at the same time, and if the method has too many blocks, it may not perfectly track them across the full span of them.

GCC and LLVM are able to leisurely repeat optimization phases where-as RyuJIT avoids it (even if some phases replicate some optimizations happened earlier). This will change once "Opt Repeat" feature gets productized[0], we will most likely see it in NativeAOT first, as you'd expect.

On matching codegen quality produced by GCC for vectorized code - I'm usually able to replicate it by iteratively refactoring the implementation and quickly testing its disasm with Disasmo extension. The main catch with this type of code is that GCC, LLVM and ILC/RyuJIT each have their own quirks around SIMD (e.g. does the compiler mistakenly rematerialize vector constant construction inside the loop body, undoing you hosting its load?). Previously, I thought it was a weakness unique to .NET but then I learned that GCC and LLVM tend to also be vulnerable to that, and even regress across updates as it sometimes happens in SIMD edge cases in .NET. But it is certainly not as common. What GCC/LLVM are better at is if you start abstracting away your SIMD code in which case it may need more help as once you start exhausting available registers due to sometimes less than optimal register allocation you start getting spills or you may be running in a technically correct behavior around vector shuffles where JIT needs to replicate portable behavior but fails to see your constant does not need it so you need to reach out for platform-specific intrinsics to work around it.

[0]: https://github.com/dotnet/runtime/issues/108902

↑