/top/
/new/
/best/
/ask/
/show/
/job/
^
slacker news
login
about
←back to thread
Multi-Token Attention
(arxiv.org)
152 points
fzliu
| 1 comments |
02 Apr 25 22:20 UTC
|
HN request time: 0.221s
|
source
Show context
bigdict
◴[
02 Apr 25 22:56 UTC
]
No.
43562732
[source]
▶
>>43562384 (OP)
#
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?
replies(4):
>>43562773
#
>>43563245
#
>>43563544
#
>>43564050
#
1.
eightysixfour
◴[
02 Apr 25 23:58 UTC
]
No.
43563245
[source]
▶
>>43562732
#
That's... not always a given for SOTA sized models. When the ROI on more training stops, it is nice to have alternatives, whether that is RL-tuned reasoning models or alternative architectures that improve specific areas of weakness.
ID:
GO
↑