Most active commenters

pizlonator(3)

Popular/hot comments

>>42477633 #
>>42477337 #

←back to thread

Compiling C to Safe Rust, Formalized

(arxiv.org)

Show context

pizlonator ◴[21 Dec 24 01:12 UTC] No.42476714[source]▶

>>42476192 (OP) #

Compiling a tiny subset of C, that is. It might be so tiny as to be useless in practice.

I have low hopes for this kind of approach; it’s sure to hit the limits of what’s possible with static analysis of C code. Also, choosing Rust as the target makes the problem unnecessarily hard because Rust’s ownership model is so foreign to how real C programs work.

replies(4): >>42476809 #>>42476961 #>>42477085 #>>42477236 #

pornel ◴[21 Dec 24 02:08 UTC] No.42476961[source]▶

>>42476714 #

Rust's ownership model is close enough for translating C. It's just more explicit and strongly typed, so the translation needs to figure out what a more free-form C code is trying to do, and map that to Rust's idioms.

For example, C's buffers obviously have lengths, but in C the length isn't explicitly tied to a pointer, so the translator has to deduce how the C program tracks the length to convert that into a slice. It's non-trivial even if the length is an explicit variable, and even trickier if it's calculated or changes representations (e.g. sometimes used in the form of one-past-the-end pointer).

Other C patterns like `bool should_free_this_pointer` can be translated to Rust's enum of `Owned`/`Borrowed`, but again it requires deducing which allocation is tied to which boolean, and what's the true safe scope of the borrowed variant.

replies(4): >>42477145 #>>42477151 #>>42477477 #>>42477822 #

pizlonator ◴[21 Dec 24 02:47 UTC] No.42477145[source]▶

>>42476961 #

Rust’s ownership model forbids things like doubly linked lists, which C programs use a lot.

That’s just one example of how C code is nowhere near meeting Rust’s requirements. There are lots of others.

replies(3): >>42477256 #>>42477615 #>>42482450 #

1. orf ◴[21 Dec 24 03:10 UTC] No.42477256[source]▶

>>42477145 #

> Rust’s ownership model forbids things like doubly linked lists, which C programs use a lot.

It’s literally in the standard library

https://doc.rust-lang.org/std/collections/struct.LinkedList....

replies(4): >>42477296 #>>42477337 #>>42477424 #>>42477565 #

2. ◴[21 Dec 24 03:22 UTC] No.42477296[source]▶

>>42477256 (TP) #

3. singron ◴[21 Dec 24 03:33 UTC] No.42477337[source]▶

>>42477256 (TP) #

This implementation uses unsafe. You can write a linked list in safe rust (e.g. using Rc), but it probably wouldn't resemble the one you write in C.

In practice, a little unsafe is usually fine. I only bring it up since the article is about translating to safe rust.

replies(3): >>42477355 #>>42477984 #>>42479070 #

4. orf ◴[21 Dec 24 03:36 UTC] No.42477355[source]▶

>>42477337 #

Safe rust isn’t “rust code with absolutely 0 unsafe blocks in any possible code path, ever”. Rc uses unsafe code every time you construct one, for example.

Unsafe blocks are an escape hatch where you promise that some invariants the compiler cannot verify are in fact true. If the translated code were to use that collection, via its safe interfaces, it would still be “safe rust”.

More generally: it’s incorrect to say that the rust ownership model forbids X when it ships with an implementation of X, regardless of if and how it uses “unsafe” - especially if “unsafe” is a feature of the ownership model that helps implement it.

replies(1): >>42477462 #

5. quuxplusone ◴[21 Dec 24 03:49 UTC] No.42477424[source]▶

>>42477256 (TP) #

But it's not in C's standard library. So the exercise isn't merely to auto-translate one language's standard library to another language's standard library (say, replacing C++ std::list with Rust LinkedList) — which would already be very hard. The exercise here is to auto-identify-and-refactor idioms open-coded in one language, into idioms suited for the other language's already-written standard library.

Imagine refactoring your average C program to use GLib for all (all!) of its data structures. Now imagine doing that, but also translating it into Rust at the same time.

replies(1): >>42477633 #

6. andrewflnr ◴[21 Dec 24 04:01 UTC] No.42477462{3}[source]▶

>>42477355 #

No one here is confused about what unsafe means. The point is, they're not implemented by following Rust's ownership model, because Rust's ownership model does in fact forbid that kind of thing.

You can nitpick the meaning of "forbids", but as far as the current context is concerned, if you translate code that implements a doubly linked list (as opposed to using one from a library) into Rust, it's not going to work without unsafe. Or an index-based graph or something.

replies(1): >>42477635 #

7. pizlonator ◴[21 Dec 24 04:32 UTC] No.42477565[source]▶

>>42477256 (TP) #

Good luck inferring how to use that from some C programmer’s deranged custom linked list.

C programmers don’t do linked lists by using libraries, they hand roll them, and often they are more complex than “just” a linked list. Lots of complex stuff out there.

8. Animats ◴[21 Dec 24 04:55 UTC] No.42477633[source]▶

>>42477424 #

> The exercise here is to auto-identify-and-refactor idioms open-coded in one language, into idioms suited for the other language's already-written standard library.

That's what LLMs are for - idiom translation. You can't trust them to do it right, though.

[Pan et al . 2024] find that while GPT-4 generates code that is more idiomatic than C2Rust, only 61% of it is correct (i.e., compiles and produces the expected result), compared to 95% for C2Rust.

This problem needs both AI-type methods to help with the idioms and formal methods to insure that the guessed idioms correctly capture the semantics.

A big advance in this project is that they can usually translate C pointer arithmetic into Rust slices. That's progress on of one of the hardest parts of the problem. C2Rust did not do that. That system just generates unsafe raw pointer arithmetic, yielding ugly Rust code that replicates C pointer semantics using function calls.

DARPA is funding research in this area under the TRACTOR program. Program awards in April 2025, so this is just getting started. It's encouraging to see so much progress already. This looks do-able.

replies(4): >>42477825 #>>42477849 #>>42478009 #>>42479395 #

9. oneshtein ◴[21 Dec 24 04:56 UTC] No.42477635{4}[source]▶

>>42477462 #

It's easy to implement doubly linked lists in safe Rust. Just ensure that every element has one OWNER, to avoid «use after free» bugs, or use a garbage collector, like a reference counter.

Unlike C++ or Rust, C has no references, only pointers, so developer must release memory manually at some arbitrary point. This is the problem and source of bugs.

replies(1): >>42477850 #

10. saghm ◴[21 Dec 24 06:05 UTC] No.42477825{3}[source]▶

>>42477633 #

Oh god, I can't even imagine trying to have formally-verified LLM-generated code. It's not surprising that even incremental progress for that would require quite a lot of ingenuity.

replies(1): >>42477989 #

11. fuhsnn ◴[21 Dec 24 06:14 UTC] No.42477849{3}[source]▶

>>42477633 #

>That's what LLMs are for - idiom translation. You can't trust them to do it right, though.

Optimizing C compilers also happened to be good at idiom recognition, and we can probably trust them a little more. The OP paper does mention future plan to use clang as well: >We have plans for a libclang-based frontend that consume actual C syntax.

If such transformation can be done at IR level it might be more efficient to be to C-IR > idiom transform to Rust-IR > run safe-checks in Rust-IR > continue compilation in C-IR or Rust-IR or combining both for better optimization properties.

replies(1): >>42478090 #

12. saghm ◴[21 Dec 24 06:15 UTC] No.42477850{5}[source]▶

>>42477635 #

While I might agree that it's easy if you use a reference counter, this is not going to be as performant as the typical linked list written in C, which is why the standard library uses unsafe for its implementation of stuff like this. If it were "easy" to just write correct `unsafe`, then it would be easy to do it in C as well.

Note that the converse to this isn't necessarily true! People I trust way more to write unsafe Rust code than me than me have argued that unsafe Rust can be harder than writing C in some ways due to having to uphold certain invariants that don't come up in C. While there are a number of blog posts on the topic that anyone interested can probably find fairly easily by googling "unsafe Rust harder than C", I'll break my usual rule of strongly preferring articles to video content to link a talk from youtube because the speaker is one of those people I mention who I'd trust more than me to write unsafe code and I remember seeing him give this talk at the meetup: https://www.youtube.com/watch?v=QAz-maaH0KM

replies(1): >>42478984 #

13. oconnor663 ◴[21 Dec 24 07:02 UTC] No.42477984[source]▶

>>42477337 #

More important than whether you use a little unsafe or a lot, is whether you can find a clean boundary above which everything can be safe. Something like a hash function or a block cipher can be piles and piles of assembly under the covers, but since the API is bytes-in-bytes-out, the safety concerns are minimal. On the other hand, memory-mapping a file is just one FFI function call, but the uncontrollable mutability of the whole thing tends to poison everything above it with unsafety.

14. ◴[21 Dec 24 07:03 UTC] No.42477989{4}[source]▶

>>42477825 #

15. immibis ◴[21 Dec 24 07:10 UTC] No.42478009{3}[source]▶

>>42477633 #

Actually, LLMs are for generating humorous nonsense. Putting them in charge of the world economy was not intended, but we did it anyway.

replies(1): >>42478092 #

16. swiftcoder ◴[21 Dec 24 07:40 UTC] No.42478090{4}[source]▶

>>42477849 #

I'm definitely bullish on this angle of compiling C down to LLVM assembly, and then "decompiling" it back to Rust (with some reference to the original C to reconstruct high-level idioms like for loops)

17. dhosek ◴[21 Dec 24 07:40 UTC] No.42478092{4}[source]▶

>>42478009 #

Given that in my (small, employer-mandated) explorations with Copilot autocompletions it’s offered incorrect suggestions about a third of the time and seems to like to also suggest deprecated APIs, I’m skeptical about the current generation’s ability to be useful at even this small task.

replies(2): >>42480380 #>>42480910 #

18. bonzini ◴[21 Dec 24 11:17 UTC] No.42478984{6}[source]▶

>>42477850 #

> unsafe Rust can be harder than writing C in some ways due to having to uphold certain invariants that don't come up in C.

Yes, this is absolutely correct and on top of this you sometimes have to employ tricks to make the compiler infer the right lifetime or type for the abstraction you're providing. On the other hand, again thanks to the abstraction power of Rust compared to C, you can test the resulting code way more easily using for example Miri.

19. imtringued ◴[21 Dec 24 11:34 UTC] No.42479070[source]▶

>>42477337 #

I don't really see it as a big "owning" of Rust that a complex pointer heavy structure with runtime defined ownership cannot be checked statically. Almost every language that people use doubly linked lists in has a GC, making the discussion kind of meaningless.

So C and C++ are the exceptions to the rule, but how do they make it easy to write doubly linked lists? Obviously, the key assumption is that that the developer makes sure that node->next->prev = node->prev->next = node (Ignoring nullptr).

With this restriction, you can safely write a doubly linked list even without reference counting.

However, this isn't true on the pointer level. The prev pointers could be pointing at the elements in a completely random order. For example tail->prev = head, head->prev = second_last and so on. So that going backwards from the tail is actually going forwards again!

Then there is also the problem of having a pointer from the outside of the linked list pointing directly at a node. You would need a weak pointer, because another pointer could have requested deletion from the linked list, while you're still holding a reference.

If you wanted to support this generic datastructure, rather than the doubly linked list you have in your head, then you would need reference counting in C/C++ as well!

What this tells you, is that Rust isn't restrictive enough to enforce these memory safe contracts. Anyone with access to the individual nodes could break the contract and make the code unsafe.

20. CodesInChaos ◴[21 Dec 24 12:56 UTC] No.42479395{3}[source]▶

>>42477633 #

Why does C2Rust produce so much incorrect code? Getting 5% wrong sounds terrible, for a 1:1 translation to unsafe Rust. What does it mis-translate?

https://dl.acm.org/doi/pdf/10.1145/3597503.3639226

> As for C2Rust, the 5% unsuccessful translations were due to compilation errors, the majority of them caused by unused imports.

I'm rather confused by what that's supposed to mean, since unused imports cause warnings, not errors in Rust.

21. glouwbug ◴[21 Dec 24 15:59 UTC] No.42480380{5}[source]▶

>>42478092 #

Sure but it takes two copilots to fly a plane

22. LeFantome ◴[21 Dec 24 17:23 UTC] No.42480910{5}[source]▶

>>42478092 #

Have you seen O3?

If your experience with something less than half as good as state-of-the-art is that it worked 66% of the time, I am not sure why you would be so dismissive about the future potential.

↑