/top/
/new/
/best/
/ask/
/show/
/job/
^
slacker news
login
about
←back to thread
Multi-Token Attention
(arxiv.org)
152 points
fzliu
| 1 comments |
02 Apr 25 22:20 UTC
|
HN request time: 0.252s
|
source
Show context
bigdict
◴[
02 Apr 25 22:56 UTC
]
No.
43562732
[source]
▶
>>43562384 (OP)
#
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?
replies(4):
>>43562773
#
>>43563245
#
>>43563544
#
>>43564050
#
1.
fabmilo
◴[
03 Apr 25 02:27 UTC
]
No.
43564050
[source]
▶
>>43562732
#
I read the paper and the results don't really convince me that is the case. But the problem still remains of being able to use information from different part of the model without squishing it to a single value with the softmax.
ID:
GO
↑