←back to thread

184 points hhs | 3 comments | | HN request time: 0.594s | source
Show context
youoy ◴[] No.41844306[source]
From the abstract:

> Proof assistants like Lean have revolutionized mathematical proof verification... > Both the synthetic dataset and the model will be made available to facilitate further research in this promising field.

Please don't use "revolutionised", "promising" on a scientific abstract... If it has revolutionised, tell me instead what was the important thing that made it happen. If it is promising, tell me instead why.

Appart from that, the second sentence says:

> Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem proving is hindered by a lack of training data.

Is it really? Of hindered by unsuitable model architecture? Because you could perefectly say that Lean is an AI capable of theorem proving... I feel that that sentence should be the potential conclusion of the paper, not the first observation. Like, "we used more training data and it works better, if we scale theorem proving training data, we can have the performance that we want, so training data is actually hindering the advancement in formal theorem proving on LLMs".

replies(2): >>41844619 #>>41847136 #
1. versteegen ◴[] No.41847136[source]
> their advancement in formal theorem proving is hindered by a lack of training data

'Formal' which here means inputs to a proof verifier/proof assistant, and I think most people would accept it as a given that by the extremely data-hungry standards of LLMs there is a severe lack of such data, and that this is a severe hindrance to getting an LLM to generalise. The question is how they might scale with more training data.

replies(1): >>41848923 #
2. Loic ◴[] No.41848923[source]
Formally proving with a system based on probabilities (temperature, etc.) is for me an oxymoron. But I am not a mathematician.

Edit: Ok, they generate proofs with the LLM, then they validate them with the Lean 4 environment. I am not sure if this is better than generating cooking recipes with a LLM and asking a cook if they make sense or not.

replies(1): >>41850067 #
3. nrfulton ◴[] No.41850067[source]
> Formally proving with a system based on probabilities (temperature, etc.) is for me an oxymoron. But I am not a mathematician.

In the context of automated theorem proving, proof search and proof checking could not be more different.

Proof search is undecidable for most logics. For that reason, heuristics are commonly used to guide proof search. This has always been the case; heuristic proof search predates the use of LLMs for theorem proving by decades.

Proof checking, on the other hand, is not just decidable but usually quite tractable and cheap. Therefore, proof checkers are typically deterministic and fast. Ideally the checker is also implemented in small amount of relatively simple and heavily audited code.

For a high-level description of how provers are typically architected, see https://www.pls-lab.org/en/de_Bruijn_criterion (or https://lawrencecpaulson.github.io/2022/01/05/LCF.html which is linked to at the bottom of the above page and goes into a bit more historical detail and nuance).

> I am not sure if this is better than generating cooking recipes with a LLM and asking a cook if they make sense or not.

Conceptually it's not at all similar to asking a cook if a recipe is "correct". For lean proofs, "correct" has an entirely precise meaning. The "cook" (i.e., proof checker) is basically free (much less than $0.01/hr). The cook (proof checker) takes at most a few seconds to read and respond to any particular recipe. And the cook (i.e., checker) never makes mistakes.

The relationship between asking a cook if a recipe is good and asking Lean to check a proof is approximately the difference between asking a mediocre human computer to compute a derivative and asking a modern mechanical computer to do the same (using a classical algorithm for doing so, not an LLM; no one is using LLMs to check proofs -- there's really no point to doing so, because the checking problem is easy).