Compiling C to Safe Rust, Formalized

(arxiv.org)

291 points love2read | 3 comments | 20 Dec 24 23:30 UTC | HN request time: 0.632s | source

Show context

wffurr ◴[21 Dec 24 00:33 UTC] No.42476523[source]▶

Note that this is done for “existing formally verified C codebases” which is a lot different from typical systems C code which is not formally verified.

replies(8): >>42476623 #>>42477360 #>>42478051 #>>42478440 #>>42478560 #>>42478707 #>>42479358 #>>42479797 #

1. safercplusplus ◴[21 Dec 24 07:27 UTC] No.42478051[source]▶

>>42476523 #

And even then, not completely reliably it seems (from Section 2.2):

> The coercions introduced by conversion rules can however lead to subtle semantic differences

The example they give is this C code:

    1 uint8_t x[1] = { 0 };
    2 uint8_t *y = x;
    3 *y = 1;
    4 assert(*x == 1); /* SUCCESS */

getting translated to this (safe) Rust code:

    1 let x: [u8; 1] = [0; 1];
    2 let mut y: Box<[u8]> = Box::new(x);
    3 y[0] = 1;
    4 assert!(x[0] == 1) /* failure */

So the pointer (iterator) targeting an existing (stack-allocated) array declared on line 2 gets translated to an owning pointer/Box) targeting a (heap-allocated) new copy of the array. So if the original code was somehow counting on the fact that the pointer iterator was actually targeting the array it was assigned to, the translated code may (quietly) not behave correctly.

For comparison, the scpptool (my project) auto-translation (to a memory safe subset of C++) feature would translate it to something like:

    1 mse::lh::TNativeArrayReplacement<uint8_t, 1> x = { 0 };
    2 mse::lh::TNativeArrayReplacement<uint8_t, 1>::iterator y = x; // implicit conversion from array to iterator
    3 *y = 1;
    4 assert(*x == 1); /* SUCCESS */ // dereferencing of array supported for compatibility

or if y is subsequently retargeted at another type of array, then line 2 may end up as something like:

    2 mse::TAnyRandomAccessIterator<uint8_t> y = x; // implicit conversion from array to iterator

So the OP project may only be converting C code that is already amenable to being converted to safe Rust. But given the challenge of the problem, I can respect the accomplishment and see some potential utility in it.

edit: added translation for line 2 in an alternate hypothetical situation.

replies(1): >>42481336 #

2. lambda ◴[21 Dec 24 18:40 UTC] No.42481336[source]▶

>>42478051 (TP) #

  the translated code may (quietly) not behave correctly.

The whole point of them show that example is that they say they catch this case, and bring it to the attention of the programmer:

  If the original C program further relies on x, our translation will error out, and will ask the
  programmer to fix their source code. This is another area where we adopt a “semi-active” approach
  to verification, and declare that some patterns are poor enough, even for C, that they ought to be
  touched up before the translation takes place.

replies(1): >>42484570 #

3. safercplusplus ◴[22 Dec 24 05:27 UTC] No.42484570[source]▶

>>42481336 #

Thanks for clarifying. The issue is what code would be rejected for auto-translation, not the correctness of an "accepted" translation (as my comment may have implied).

The point of noting that the example translation quietly does the wrong thing, is that that is the reason that it would have to be ("unconditionally") rejected.

While the paper does suggest that their example translation would be rejected:

> If the original C program further relies on x, our translation will error out

note that precisely determining whether or not the program "further relies on x" statically (at compile/translation-time) is, in general, a "Halting Problem". (I.e. Cannot be reliably done with finite compute resources.) So they would presumably have to be conservative and reject any cases were they cannot prove that the program does not "further rely on x". So it's notable that they choose to use a (provisional) translation that has to be rejected in a significant set of false positive cases.

And at least on initial consideration, it seems to me that an alternative translation could have, for example, used RefCell<>s or whatever and avoided the possibility of "quietly doing the wrong thing". (And thus, depending on your/their requirements, avoid the need for unconditional rejection.) Now, one might be an a situation where they'd want to avoid the run-time overhead and/or potential unreliability of RefCell<>s, but even then it seems to me that their translation choice does not technically avoid either of those things. Their solution allocates on the heap which has at least some theoretical run-time overhead, and could theoretically fail/panic.

Now I'm not concluding here that their choice is not the right one for their undertaking. I'm just suggesting that choosing a (provisional) translation that has to be rejected with significant false positives (because it might quietly do the wrong thing) is at least initially notable. And that there are other solutions out there that demonstrate translation of C to a (high-performance, deterministic) memory-safe language/dialect that don't have the same limitations.

↑