Compiling C to Safe Rust, Formalized

Show context

pizlonator ◴[21 Dec 24 01:12 UTC] No.42476714[source]▶

Compiling a tiny subset of C, that is. It might be so tiny as to be useless in practice.

I have low hopes for this kind of approach; it’s sure to hit the limits of what’s possible with static analysis of C code. Also, choosing Rust as the target makes the problem unnecessarily hard because Rust’s ownership model is so foreign to how real C programs work.

replies(4): >>42476809 #>>42476961 #>>42477085 #>>42477236 #

pornel ◴[21 Dec 24 02:08 UTC] No.42476961[source]▶

>>42476714 #

Rust's ownership model is close enough for translating C. It's just more explicit and strongly typed, so the translation needs to figure out what a more free-form C code is trying to do, and map that to Rust's idioms.

For example, C's buffers obviously have lengths, but in C the length isn't explicitly tied to a pointer, so the translator has to deduce how the C program tracks the length to convert that into a slice. It's non-trivial even if the length is an explicit variable, and even trickier if it's calculated or changes representations (e.g. sometimes used in the form of one-past-the-end pointer).

Other C patterns like `bool should_free_this_pointer` can be translated to Rust's enum of `Owned`/`Borrowed`, but again it requires deducing which allocation is tied to which boolean, and what's the true safe scope of the borrowed variant.

replies(4): >>42477145 #>>42477151 #>>42477477 #>>42477822 #

pizlonator ◴[21 Dec 24 02:47 UTC] No.42477145[source]▶

>>42476961 #

Rust’s ownership model forbids things like doubly linked lists, which C programs use a lot.

That’s just one example of how C code is nowhere near meeting Rust’s requirements. There are lots of others.

replies(3): >>42477256 #>>42477615 #>>42482450 #

orf ◴[21 Dec 24 03:10 UTC] No.42477256[source]▶

>>42477145 #

> Rust’s ownership model forbids things like doubly linked lists, which C programs use a lot.

It’s literally in the standard library

https://doc.rust-lang.org/std/collections/struct.LinkedList....

replies(4): >>42477296 #>>42477337 #>>42477424 #>>42477565 #

quuxplusone ◴[21 Dec 24 03:49 UTC] No.42477424[source]▶

>>42477256 #

But it's not in C's standard library. So the exercise isn't merely to auto-translate one language's standard library to another language's standard library (say, replacing C++ std::list with Rust LinkedList) — which would already be very hard. The exercise here is to auto-identify-and-refactor idioms open-coded in one language, into idioms suited for the other language's already-written standard library.

Imagine refactoring your average C program to use GLib for all (all!) of its data structures. Now imagine doing that, but also translating it into Rust at the same time.

replies(1): >>42477633 #

Animats ◴[21 Dec 24 04:55 UTC] No.42477633[source]▶

>>42477424 #

> The exercise here is to auto-identify-and-refactor idioms open-coded in one language, into idioms suited for the other language's already-written standard library.

That's what LLMs are for - idiom translation. You can't trust them to do it right, though.

[Pan et al . 2024] find that while GPT-4 generates code that is more idiomatic than C2Rust, only 61% of it is correct (i.e., compiles and produces the expected result), compared to 95% for C2Rust.

This problem needs both AI-type methods to help with the idioms and formal methods to insure that the guessed idioms correctly capture the semantics.

A big advance in this project is that they can usually translate C pointer arithmetic into Rust slices. That's progress on of one of the hardest parts of the problem. C2Rust did not do that. That system just generates unsafe raw pointer arithmetic, yielding ugly Rust code that replicates C pointer semantics using function calls.

DARPA is funding research in this area under the TRACTOR program. Program awards in April 2025, so this is just getting started. It's encouraging to see so much progress already. This looks do-able.

replies(4): >>42477825 #>>42477849 #>>42478009 #>>42479395 #

saghm ◴[21 Dec 24 06:05 UTC] No.42477825[source]▶

>>42477633 #

Oh god, I can't even imagine trying to have formally-verified LLM-generated code. It's not surprising that even incremental progress for that would require quite a lot of ingenuity.

replies(1): >>42477989 #