(rti.github.io)

295 points rttti | 1 comments | 03 Sep 25 15:30 UTC | HN request time: 0.255s | source

Show context

CGMthrowaway ◴[03 Sep 25 20:06 UTC] No.45119871[source]▶

Honest feedback - I was really excited when I read the opening. However, I did not come away from this without a greater understanding than I already had.

For reference, my initial understanding was somewhat low: basically I know a) what embedding is basically b) transformers work by matrix multiplication, and c) it's something like a multi-threaded Markov chain generator with the benefit of prior-trained embeddings

replies(8): >>45120114 #>>45120200 #>>45122565 #>>45123711 #>>45125243 #>>45128482 #>>45129469 #>>45134872 #

onename ◴[03 Sep 25 20:42 UTC] No.45120200[source]▶

>>45119871 #

Have you checked out this video from 3Blue1Brown that talks bit about transformers?

https://youtu.be/wjZofJX0v4M

replies(3): >>45121968 #>>45125756 #>>45125969 #

1. imtringued ◴[04 Sep 25 10:47 UTC] No.45125756[source]▶

>>45120200 #

I personally would rather recommend people to just look at these architectural diagrams [0] and try to understand them. There is the caveat that they do not show how attention works. For that you need to understand softmax(QK^T)V and multi head attention being a repetition of this multiple times. GQA, MHA, etc just messes around with reusing Q or K or V in clever ways.

[0] https://huggingface.co/blog/vtabbott/mixtral

↑

Understanding Transformers Using a Minimal Example