Olmo 3: Charting a path through the model flow to lead open-source AI

(allenai.org)

361 points mseri | 1 comments | 21 Nov 25 06:50 UTC | HN request time: 0.321s | source

Show context

Y_Y ◴[21 Nov 25 09:59 UTC] No.46002975[source]▶

I asked it if giraffes were kosher to eat and it told me:

> Giraffes are not kosher because they do not chew their cud, even though they have split hooves. Both requirements must be satisfied for an animal to be permissible.

HN will have removed the extraneous emojis.

This is at odds with my interpretation of giraffe anatomy and behaviour and of Talmudic law.

Luckily old sycophant GPT5.1 agrees with me:

> Yes. They have split hooves and chew cud, so they meet the anatomical criteria. Ritual slaughter is technically feasible though impractical.

replies(3): >>46004171 #>>46005088 #>>46006063 #

embedding-shape ◴[21 Nov 25 13:10 UTC] No.46004171[source]▶

>>46002975 #

How many times did you retry (so it's not just up to chance), what were the parameters, specifically for temperature and top_p?

replies(2): >>46005252 #>>46005308 #

latexr ◴[21 Nov 25 15:05 UTC] No.46005252[source]▶

>>46004171 #

> How many times did you retry (so it's not just up to chance)

If you don’t know the answer to a question, retrying multiple times only serves to amplify your bias, you have no basis to know the answer is correct.

replies(3): >>46005264 #>>46005329 #>>46005903 #

observationist ◴[21 Nov 25 16:23 UTC] No.46005903[source]▶

>>46005252 #

https://en.wikipedia.org/wiki/Monte_Carlo_method

If it's out of distribution, you're more likely to get a chaotic distribution around the answer to a question, whereas if it's just not known well, you'll get a normal distribution, with a flatter slope the less well modeled a concept is.

There are all sorts of techniques and methods you can use to get a probabilistically valid assessment of outputs from LLMs, they're just expensive and/or tedious.

Repeated sampling gives you the basis to make a Bayesian model of the output, and you can even work out rigorous numbers specific to the model and your prompt framework by sampling things you know the model has in distribution and comparing the curves against your test case, giving you a measure of relative certainty.

replies(1): >>46006084 #

latexr ◴[21 Nov 25 16:42 UTC] No.46006084[source]▶

>>46005903 #

Sounds like just not using an LLM would be considerably less effort and fewer wasted resources.

replies(1): >>46006204 #

1. dicknuckle ◴[21 Nov 25 16:54 UTC] No.46006204[source]▶

>>46006084 #

It's a way to validate the LLM output in a test scenario.

↑