←back to thread

295 points rttti | 1 comments | | HN request time: 0.255s | source
Show context
CGMthrowaway ◴[] No.45119871[source]
Honest feedback - I was really excited when I read the opening. However, I did not come away from this without a greater understanding than I already had.

For reference, my initial understanding was somewhat low: basically I know a) what embedding is basically b) transformers work by matrix multiplication, and c) it's something like a multi-threaded Markov chain generator with the benefit of prior-trained embeddings

replies(8): >>45120114 #>>45120200 #>>45122565 #>>45123711 #>>45125243 #>>45128482 #>>45129469 #>>45134872 #
onename ◴[] No.45120200[source]
Have you checked out this video from 3Blue1Brown that talks bit about transformers?

https://youtu.be/wjZofJX0v4M

replies(3): >>45121968 #>>45125756 #>>45125969 #
1. imtringued ◴[] No.45125756[source]
I personally would rather recommend people to just look at these architectural diagrams [0] and try to understand them. There is the caveat that they do not show how attention works. For that you need to understand softmax(QK^T)V and multi head attention being a repetition of this multiple times. GQA, MHA, etc just messes around with reusing Q or K or V in clever ways.

[0] https://huggingface.co/blog/vtabbott/mixtral