Compiling C to Safe Rust, Formalized

Indeed, yes. Someone tried converting C OpenJPEG to low-level unsafe Rust using c2rust. OpenJPEG was known to segfault on a test case. I tried that test case on the Rust version. Segfaulted in the equivalent place in the Rust code.

At least it's compatible. But that approach is a dead end. To make any progress, translation must recognize the common idioms of the language and upgrade those to the ideomatic forms of the target language. Compiling into Rust generates awful Rust, full of calls to functions that do unsafe C-type pointer manipulation.

The big upgrading problems mostly involve pointers. The most promising result in this paper is that they figured out how to convert C pointer arithmetic into Rust slices. Slices can do most of the things C pointer arithmetic can do, and now someone automated the translation. Pointer arithmetic that can't be translated has to be looked at with deep suspicion.

A useful way to think about this is that raw pointers in C which point to arrays implicitly have a length associated with them. That length is not visible in C source code, but exists somewhere, as a function of the program state. It might be a constant. It might be the size requested back at a "malloc" call. It might be a parameter to a function. It's usually not too hard for maintenance programmers to find array lengths.

That could be an LLM kind of problem. Ask an LLM, "Examine this code. What is the length of array foo?" Then use that to guide translation to Rust by a non-LLM translator. If the LLM is wrong, the resulting Rust will get subscript errors or have an oversize array, but will not be unsafe. Array size info idioms are stylized enough in C that it should be possible to get it right most of the time. Especially since LLMs can read comments.