Guide to Fine-Tuning LLMs

(arxiv.org)

1. kcorbitt ◴[22 Oct 24 05:21 UTC] No.41911422[source]▶

>>41911255 (OP) #

I saw this when it was making the rounds on X a few days ago. Fair warning: it seems like at least some sections are AI-generated, and there isn't much insight to be gained from reading the actual sections compared to eg. reading the relevant category pages on Huggingface.

replies(4): >>41911596 #>>41911605 #>>41912253 #>>41912270 #

2. daghamm ◴[22 Oct 24 06:02 UTC] No.41911596[source]▶

>>41911422 #

I would not say that, as long as it is a good summary there is a value in having everything in the same document.

Obviously they should have stated that this is partially generated, but at least they are dog fooding it :)

3. YetAnotherNick ◴[22 Oct 24 06:03 UTC] No.41911605[source]▶

>>41911422 #

Not only the it seems to be AI generated, it seems these guys don't even know about best practices or even what works. e.g. It contains archaic comparison of optimizers and its pros and cons, but for LLMs no optimizer other than Adam and new ones like Lion works.

replies(1): >>41912039 #

4. anothername12 ◴[22 Oct 24 06:33 UTC] No.41911748[source]▶

>>41911255 (OP) #

Well, it sucks that we’re at the “best practices” phase already

replies(3): >>41911886 #>>41912332 #>>41912970 #

5. ◴[22 Oct 24 06:42 UTC] No.41911804[source]▶

>>41911255 (OP) #

6. p1esk ◴[22 Oct 24 07:02 UTC] No.41911886[source]▶

>>41911748 #

It sucks that we’re still at “best practices” phase. We’ve been in this phase for the last three decades [1], and I really hope we enter “good theory” phase soon.

[1] https://cseweb.ucsd.edu/classes/wi08/cse253/Handouts/lecun-9...

replies(1): >>41916144 #

7. abc-1 ◴[22 Oct 24 07:32 UTC] No.41912039{3}[source]▶

>>41911605 #

Is there a paper on this? Why do no other optimizers give good results? Adam requires insane amounts of memory so alternatives would be welcome.

8. worstspotgain ◴[22 Oct 24 08:12 UTC] No.41912253[source]▶

>>41911422 #

Glancing at the authors' names, it's possible that none of them are native English speakers. Any chance that the sections you're referring to were just AI-polished rather than AI-generated?

replies(1): >>41912649 #

9. danielhanchen ◴[22 Oct 24 08:17 UTC] No.41912270[source]▶

>>41911422 #

I took a skim through it in the morning - I like the LoRA Learns Less and Forgets Less paper more https://openreview.net/forum?id=aloEru2qCG - it has much more signal in a few pages - also the original QLoRA paper from Dettmers https://arxiv.org/abs/2305.14314 has so many more important morsels.

But all in all, the review is a reasonable "manual" I guess. I would have liked maybe more instructive comprehensive practical examples, and maybe more mention of other OSS packages for finetuning :))

10. kleiba ◴[22 Oct 24 08:32 UTC] No.41912332[source]▶

>>41911748 #

Why is that?

11. qeternity ◴[22 Oct 24 09:33 UTC] No.41912649{3}[source]▶

>>41912253 #

No, this paper was edited yesterday. The original (you can verify on arxiv) contained this incredible section: "6.10 Optimised Routing and Pruning Operations (ORPO)"

The actual ORPO paper is "Odds Ratio Preference Optimisation" and it has nothing to do with pruning. This goes way beyond native language preference.

replies(2): >>41913061 #>>41913443 #

12. make3 ◴[22 Oct 24 10:37 UTC] No.41912970[source]▶

>>41911748 #

there's likely still an infinite amount of things to figure out, transformers haven't been out for 10 years yet

13. raymond_goo ◴[22 Oct 24 10:51 UTC] No.41913043[source]▶

>>41911255 (OP) #

Ctrl-F: Unsloth --> no results == bad paper

replies(1): >>41913074 #

14. cubefox ◴[22 Oct 24 10:56 UTC] No.41913061{4}[source]▶

>>41912649 #

Wow, so significant parts of the paper could still be LLM confabulation.

15. youoy ◴[22 Oct 24 10:59 UTC] No.41913074[source]▶

>>41913043 #

But you can find "delve"

16. ◴[22 Oct 24 12:05 UTC] No.41913434[source]▶

>>41911255 (OP) #

17. espadrine ◴[22 Oct 24 12:06 UTC] No.41913443{4}[source]▶

>>41912649 #

It takes no time at all to find other major mistakes. For instance, the Mixtral diagram § 6.6.1 shows a single router that selects separate 32-layer transformers. Instead, Mixtral has one router per layer (inside of each block), and it doesn’t select a transformer block: it selects a feedforward.

18. ◴[22 Oct 24 16:47 UTC] No.41916144{3}[source]▶

>>41911886 #

19. aubanel ◴[22 Oct 24 16:48 UTC] No.41916158[source]▶

>>41911255 (OP) #

FYI, some sections are LLM-generated bullshit: https://x.com/sam_paech/status/1848332471953448972

The example shown in the tweet has been edited out of the paper since, but there must be others. High noise, low signal content.