←back to thread

152 points fzliu | 1 comments | | HN request time: 0.22s | source
1. curiousfiddler ◴[] No.43563854[source]
So, why would this extract more semantic meaning than multi-head attention? Isn't the whole point of multiple heads similar to how CNNs use multiple types of filters to extract different semantic relationships?