Olmo 3: Charting a path through the model flow to lead open-source AI

1. Y_Y ◴[21 Nov 25 09:59 UTC] No.46002975[source]▶

I asked it if giraffes were kosher to eat and it told me:

> Giraffes are not kosher because they do not chew their cud, even though they have split hooves. Both requirements must be satisfied for an animal to be permissible.

HN will have removed the extraneous emojis.

This is at odds with my interpretation of giraffe anatomy and behaviour and of Talmudic law.

Luckily old sycophant GPT5.1 agrees with me:

> Yes. They have split hooves and chew cud, so they meet the anatomical criteria. Ritual slaughter is technically feasible though impractical.

replies(3): >>46004171 #>>46005088 #>>46006063 #

2. embedding-shape ◴[21 Nov 25 13:10 UTC] No.46004171[source]▶

>>46002975 (TP) #

How many times did you retry (so it's not just up to chance), what were the parameters, specifically for temperature and top_p?

replies(2): >>46005252 #>>46005308 #

3. mistrial9 ◴[21 Nov 25 14:47 UTC] No.46005088[source]▶

>>46002975 (TP) #

due to reforms around the first centuries of the Common Era, trivia questions to certain tribal priests are no longer a litmus test for acceptable public goods in the marketplace.

4. latexr ◴[21 Nov 25 15:05 UTC] No.46005252[source]▶

>>46004171 #

> How many times did you retry (so it's not just up to chance)

If you don’t know the answer to a question, retrying multiple times only serves to amplify your bias, you have no basis to know the answer is correct.

replies(3): >>46005264 #>>46005329 #>>46005903 #

5. embedding-shape ◴[21 Nov 25 15:07 UTC] No.46005264{3}[source]▶

>>46005252 #

Well, seems in this case parent did know the answer, so I'm not sure what your point is.

I'm asking for the sake of reproducibility and to clarify if they used the text-by-chance generator more than once, to ensure they didn't just hit one out of ten bad cases since they only tested it once.

replies(1): >>46006117 #

6. Y_Y ◴[21 Nov 25 15:12 UTC] No.46005308[source]▶

>>46004171 #

Sorry I lost the chat, but it was default parameters on the 32B model. It cited some books saying that they had three stomachs and didn't ruminate, but after I pressed on these points it admitted that it left out the fourth stomach because it was small, and claimed that the rumination wasn't "true" in some sense.

7. zamadatix ◴[21 Nov 25 15:15 UTC] No.46005329{3}[source]▶

>>46005252 #

If you retry until it gives the answer you want then it only serves to amplify your bias. If you retry and see how often it agrees with itself then it serves to show there is no confidence in an answer all around.

It's a bit of a crutch for LLMs lacking the ability to just say "I'm not sure" because doing so is against how they are rewarded in training.

replies(1): >>46005793 #

8. oivey ◴[21 Nov 25 16:10 UTC] No.46005793{4}[source]▶

>>46005329 #

You’re still likely to just amplify your own bias if you don’t do some basic experimental controls like having some preselected criteria on how many retries you’re going to do or how many agreeing trials are statistically significant.

9. observationist ◴[21 Nov 25 16:23 UTC] No.46005903{3}[source]▶

>>46005252 #

https://en.wikipedia.org/wiki/Monte_Carlo_method

If it's out of distribution, you're more likely to get a chaotic distribution around the answer to a question, whereas if it's just not known well, you'll get a normal distribution, with a flatter slope the less well modeled a concept is.

There are all sorts of techniques and methods you can use to get a probabilistically valid assessment of outputs from LLMs, they're just expensive and/or tedious.

Repeated sampling gives you the basis to make a Bayesian model of the output, and you can even work out rigorous numbers specific to the model and your prompt framework by sampling things you know the model has in distribution and comparing the curves against your test case, giving you a measure of relative certainty.

replies(1): >>46006084 #

10. Flere-Imsaho ◴[21 Nov 25 16:39 UTC] No.46006063[source]▶

>>46002975 (TP) #

Models should not have memorised whether animals are kosher to eat or not. This is information that should be retrieved from RAG or whatever.

If a model responded with "I don't know the answer to that", then that would be far more useful. Is anyone actually working on models that are trained to admit not knowing an answer to everything?

replies(4): >>46006191 #>>46009037 #>>46009499 #>>46010963 #

11. latexr ◴[21 Nov 25 16:42 UTC] No.46006084{4}[source]▶

>>46005903 #

Sounds like just not using an LLM would be considerably less effort and fewer wasted resources.

replies(1): >>46006204 #

12. latexr ◴[21 Nov 25 16:46 UTC] No.46006117{4}[source]▶

>>46005264 #

> so I'm not sure what your point is.

That your suggestion would not correspond to real use by real regular people. OP posted the message as noteworthy because they knew it was wrong. Anyone who didn’t and trusts LLMs blindly (which is not a small number) would’ve left it at that and gone about their day with wrong information.

replies(1): >>46006285 #

13. spmurrayzzz ◴[21 Nov 25 16:53 UTC] No.46006191[source]▶

>>46006063 #

There is an older paper on something related to this [1], where the model outputs reflection tokens that either trigger retrieval or critique steps. The idea is that the model recognizes that it needs to fetch some grounding subsequent to generating some factual content. Then it reviews what it previously generated with the retrieved grounding.

The problem with this approach is that it does not generalize well at all out of distribution. I'm not aware of any follow up to this, but I do think it's an interesting area of research nonetheless.

[1] https://arxiv.org/abs/2310.11511

14. dicknuckle ◴[21 Nov 25 16:54 UTC] No.46006204{5}[source]▶

>>46006084 #

It's a way to validate the LLM output in a test scenario.

15. embedding-shape ◴[21 Nov 25 17:03 UTC] No.46006285{5}[source]▶

>>46006117 #

> That your suggestion would not correspond to real use by real regular people.

Which wasn't the point either, the point was just to ask "Did you run one prompt, or many times?" as that obviously impacts how seriously you can take whatever outcome you get.

16. anonym29 ◴[21 Nov 25 21:11 UTC] No.46009037[source]▶

>>46006063 #

>Models should not have memorised whether animals are kosher to eat or not.

Agreed. Humans do not perform rote memorization for all possibilities of rules-based classifications like "kosher or not kosher".

>This is information that should be retrieved from RAG or whatever.

Firm disagreement here. An intelligent model should either know (general model) or RAG-retrieve (non-general model) the criteria for evaluating whether an animal is kosher or not, and infer based on knowledge of the animal (either general model, or RAG-retrieval for a non-general model) whether or not the animal matches the criteria.

>If a model responded with "I don't know the answer to that", then that would be far more useful.

Again, firm disagreement here. "I don't know" is not a useful answer to a question that can be easily answered by cross-referencing easily-verifiable animal properties against the classification rules. At the very least, an intelligent model should explain which piece of information it is missing (properties of the animal in question OR the details of the classification rules), rather than returning a zero-value response.

To wit: if you were conducting an interview for a developer candidate, and you asked them whether Python supports functions, methods, both, or neither, would "I don't know" ever be an appropriate answer, even if the candidate genuinely didn't know off the top of their head? Of course not - you'd desire a candidate who didn't know to say something more along the lines of "I don't know, but here's what I would do to figure out the answer for you".

A plain and simple "I don't know" adds zero value to the conversation. While it doesn't necessarily add negative value to the conversation the way a confidently incorrect answer does, the goal for intelligent models should never be to produce zero value, it should be to produce nonzero positive value, even when it lacks required information.

17. robrenaud ◴[21 Nov 25 22:01 UTC] No.46009499[source]▶

>>46006063 #

Benchmarks need to change.

There is a 4 choice choice question. Your best guess is the answer is B, at about 35% chance of being right. If you are graded on fraction of questions answered correctedly, the optimization pressure is simply to answer B.

If you could get half credit for answering "I don't know", we'd have a lot more models saying that when they are not confident.