←back to thread

425 points karimf | 3 comments | | HN request time: 0s | source
Show context
daxfohl ◴[] No.45657955[source]
I wonder if a linear-space, constant-time model like RWKV or S4 would work better here. For audio, I wouldn't think you'd need long range context, and all-to-all mapping seems like overkill.

Maybe a transformer could be running in parallel, but much lower frequency, where the linear model feeds it "summary" tokens once per second, whose information would mostly be "text", but also some hint of emotion and other cues. Then the output of this could be fed back to the linear model so that it would know what it was saying and with what emotion. Basically the transformer would be the low frequency long range context thinker (and feeler), and the linear model would translate that to and from phonetics.

They'd be trained in parallel, so those transformer tokens would attain meaning at training time, not something that would have to be pre-defined. So it'd still be purely phonetic e2e, no direct translation to text. It could even end up being a good way to compress text for LLMs, since low-value words might have smaller representation in the token.

Probably would never reach the level of text based LLMs for logic and code and such, but that somewhat parallels humans anyway; it's pretty hard to explain an algorithm in detail in plain conversation.

replies(3): >>45658525 #>>45660560 #>>45663963 #
tehnub ◴[] No.45658525[source]
Write this paper please!
replies(1): >>45658872 #
1. daxfohl ◴[] No.45658872[source]
If anyone wants to buy me some GPU time I'd be happy to try it out! Fair warning: my only experience in deep learning thus far was training a CNN to count dots on an image, which worked semi reliably up to 8, when the image was perfectly square black "dots" on a perfectly white background.
replies(2): >>45659594 #>>45662126 #
2. smokel ◴[] No.45659594[source]
Off-topic, but it would be great if everyone who voiced their opinion on something would add a small disclaimer with their actual knowledge about the subject. Thanks for sharing :)
3. fragmede ◴[] No.45662126[source]
Sure. what's your venmo?