←back to thread

549 points orcul | 1 comments | | HN request time: 0.246s | source
1. ilaksh ◴[] No.41894811[source]
One thing that always seemed important to these discussions is that the serial structure of language is probably not an optimization but just due to the reality that we can only handle uttering or hearing one sound at a time.

In my mind there should be some kind of parallel/hierarchical model that comes after language layers and then optionally can be converted back to a series of tokens. The middle layers are trained on world models such as from videos, intermediary layers on mapping, and other layers on text, including quite a lot of transcripts etc. to make sure the middle layers fully ground the outer layers.

I don't really understand transformers and diffusion transformers etc., but I am optimistic that as we increase the compute and memory capacity over the next few years it will allow more video data to be integrated with language data. That will result in fully grounded multimodal models that are even more robust and more general purpose.

I keep waiting to hear about some kind of manufacturing/design breakthroughs with memristors or some kind of memory-centric computing that gives another 100 X boost in model sizes and/or efficiency. Because it does seem that the major functionality gains have been unlocked through scaling hardware which allowed the development of models that took advantage of the new scale. For me large multimodal video datasets with transcripts and more efficient hardware to compress and host them are going to make AI more robust.

I do wish I understood transformers better though because it seems like somehow they are more general-purpose. Is there something about them that is not dependant on the serialization or tokenization that can be extracted to make other types of models more general? Maybe they are tokens that have scalars attached which are still fully contextualized but are computed as many parallel groups for each step.