(arxiv.org)

152 points fzliu | 1 comments | 02 Apr 25 22:20 UTC | HN request time: 0s | source

Show context

bigdict ◴[02 Apr 25 22:56 UTC] No.43562732[source]▶

Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?

replies(4): >>43562773 #>>43563245 #>>43563544 #>>43564050 #

1. eightysixfour ◴[02 Apr 25 23:58 UTC] No.43563245[source]▶

>>43562732 #

That's... not always a given for SOTA sized models. When the ROI on more training stops, it is nice to have alternatives, whether that is RL-tuned reasoning models or alternative architectures that improve specific areas of weakness.

↑

Multi-Token Attention