←back to thread

Use Prolog to improve LLM's reasoning

(shchegrikovich.substack.com)
379 points shchegrikovich | 1 comments | | HN request time: 0.202s | source
Show context
YeGoblynQueenne ◴[] No.41874200[source]
That's not going to work. Garbage in - Garbage out is success-set equivalent to Garbage in - Prolog out.

Garbage is garbage and failure to reason is failure to reason no matter the language. If your LLM can't translate your problem to a Prolog program that solves your problem- Prolog can't solve your problem.

replies(3): >>41874322 #>>41875551 #>>41876345 #
Philpax ◴[] No.41874322[source]
This is a shallow critique that does not engage with the core idea. Specifying the problem is not the same as solving the problem.
replies(2): >>41874902 #>>41880822 #
YeGoblynQueenne ◴[] No.41874902[source]
I've programmed in Prolog for ~13 years and my PhD thesis is in machine learning of Prolog programs. How deep would you like me to go?
replies(3): >>41875047 #>>41875563 #>>41879227 #
Philpax ◴[] No.41875047[source]
As deep as is required to actually make your argument!
replies(1): >>41878138 #
YeGoblynQueenne ◴[] No.41878138[source]
You'll have to be more specific than that. For me what I point out is obvious: Prolog is not magick. Your program won't magickally reason if you write it in Prolog, much less reason correctly. If an LLM translates a Problem to the wrong Prolog program, Prolog won't magickally turn it into a correct program. And that's just rephrasing what I said in my comment above. There's really not much more to say.

Here's just one more observation: the problems where translating reasoning to Prolog will work best are problems where there are a lot of examples of Prolog to be found on the web, e.g. wolf-cabbage-goat problems and the like. With problems like that it is much easier for an LLM to generate a correct translation of the problem to Prolog and get a correct solution just because there's lots of examples. But if you choose a problem that's rarely attacked with Prolog code, like, I don't know, some mathematical problem that obtains in nuclear physics as a for instance, then an LLM will be much more likely to generate garbage Prolog, while e.g. Fortran would be a better target language. From what I can see, the papers linked in the article above concentrate on the Prolog-friendly kind of problem, like logical puzzles and the like. That smells like cherry picking to me, or just good, old confirmation bias.

Again, Prolog is not magick. The article above and the papers it links to seem to take this attitude of "just add Prolog" and that will make LLMs suddenly magickally reason with fairy dust on top. Ain't gonna happen.

replies(2): >>41880853 #>>41886337 #
tannhaeuser ◴[] No.41886337[source]
Of course Prolog is no magic but I'm sure you know the argument in favor of having LLMs generate Prolog are based on the observation that prompting or otherwise making an LLM to perform chain-of-thought reasoning results in demonstratable experimental improvements, with [1] the canonical paper, and using Prolog with its unique characteristics and roots in NLP and logic an extension of that idea termed program-as-thought. OpenAI's latest o1 model makes heavy use of CoD internally until it returns an answer to the user.

[1]: https://arxiv.org/abs/2201.11903

replies(1): >>41888284 #
1. YeGoblynQueenne ◴[] No.41888284[source]
My thoughts on CoT and the extent to which it "elicits reasoning" align almost perfectly with the criticism of Rao Kambhampati and his students:

https://arxiv.org/abs/2405.04776

Their argument is that CoT can only improve performance of LLMs in reasoning tasks when the prompter already knows the answer and can somehow embed it in their prompt. The paper I link above supports this intuition with empirical results, summarised in the abstract as follows:

While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts.

And if that's the case and I need to know the solution to a reasoning task before I can prompt an LLM to solve it- then why do I need to prompt an LLM to solve it? Or, if I'm just asking an LLM to generate the Prolog code I can write myself then what's the point of that? As I argue in another comment, an LLM will only do well in generating correct code if it has seen some sufficient number of examples of the code I'm asking it to generate anyway. So I don't think that CoT, used to generate Prolog, is really adding anything to my capability to solve problems by coding in Prolog.

I have no idea how o1 works internally and I prefer not to speculate but it doesn't seem to be some silver bullet that will make LLMs capable of reasoning.