←back to thread

721 points ralusek | 7 comments | | HN request time: 0.001s | source | bottom
Show context
ryandrake ◴[] No.41870217[source]
I'm making some big assumptions about Adobe's product ideation process, but: This seems like the "right" way to approach developing AI products: Find a user need that can't easily be solved with traditional methods and algorithms, decide that AI is appropriate for that thing, and then build an AI system to solve it.

Rather than what many BigTech companies are currently doing: "Wall Street says we need to 'Use AI Somehow'. Let's invest in AI and Find Things To Do with AI. Later, we'll worry about somehow matching these things with user needs."

replies(15): >>41870304 #>>41870341 #>>41870369 #>>41870422 #>>41870672 #>>41870780 #>>41870851 #>>41870929 #>>41871322 #>>41871724 #>>41871915 #>>41871961 #>>41872523 #>>41872850 #>>41873162 #
jthacker ◴[] No.41870369[source]
This is certainly a great immediately useful tool but also a relatively small ROI, both the return and the investment. Big tech is aiming for a much bigger return on a clearly bigger investment. That’s going to potentially look like a lot of useless stuff in the meantime. Also, if it wasn’t for big tech and big investments, there wouldn’t even be these tools / models at this level of sophistication for others to be using for applications like this one.
replies(2): >>41870490 #>>41870639 #
HarHarVeryFunny ◴[] No.41870490[source]
While the press lumps it all together as "AI", you have to differentiate LLMs (driven by big tech and big money) from unrelated image/video types of generative models and approaches like diffusion, NeRF, Gaussian splatting, etc, which have their roots in academia.
replies(1): >>41870923 #
copperx ◴[] No.41870923[source]
LLMs don't have their roots in academia?
replies(1): >>41871017 #
1. withinboredom ◴[] No.41871017[source]
Not anymore.
replies(2): >>41871091 #>>41871250 #
2. HarHarVeryFunny ◴[] No.41871091[source]
Not at all - Transformer was invented by a bunch of former Google employees (while at Google), primarily Jakob Uszkoreit and Noam Shazeer. Of course as with anything it builds on what had gone before, but it's really quite a novel architecture.
replies(1): >>41872382 #
3. stavros ◴[] No.41871250[source]
This makes no sense. A thing's roots don't change, either it did start there or it didn't.
replies(1): >>41871790 #
4. HarHarVeryFunny ◴[] No.41871790[source]
It didn't.

At least, the Transformer didn't. The abstract idea of a language model goes way back though within the field of linguistics, and people were building simplistic "N-gram" models before ever using neural nets, then using other types of neural net such as LSTMs and CNNs(!) before Google invented the Transformer (primarily with the goal of fully utilizing the parallelism available from GPUs - which couldn't be done with a recurrent model like LSTM).

5. ansk ◴[] No.41872382[source]
The scientific impact of the transformer paper is large, but in my opinion the novelty is vastly overstated. The primary novelty is adapting the (already existing) dot-product attention mechanism to be multi-headed. And frankly, the single-head -> multi-head evolution wasn't particularly novel -- it's the same trick the computer vision community applied to convolutions 5 years earlier, yielding the widely-adopted grouped convolution. The lasting contribution of the Transformer paper is really just ordering the existing architectural primitives (attention layers, feedforward layers, normalization, residuals) in a nice, reusable block. In my opinion, the most impactful contributions in the lineage of modern attention-based LLMs are the introduction of dot-product attention (Bahdanau et al, 2015) and the first attention-based sequence-to-sequence model (Graves, 2013). Both of these are from academic labs.

As a side note, a similar phenomenon occurred with the Adam optimizer, where the ratio of public/scientific attribution to novelty is disproportionately large (the Adam optimizer is very minor modification of the RMSProp + momentum optimization algorithm presented in the same Graves, 2013 paper mentioned above)

replies(1): >>41872923 #
6. HarHarVeryFunny ◴[] No.41872923{3}[source]
I think the most novel part of it, and where a lot of the power comes from, is in the key based attention, which then operationally gives rise to the emergence of induction heads (whereby pair of adjacent layers coordinate to provide a powerful context lookup and copy mechanism).

The reusable/stackable block is of course a key part of the design since the key insight was that language is as much hierarchical as sequential, and can therefore be processed in parallel (not in sequence) with a hierarchical stack of layers that each use the key-based lookup mechanism to access other tokens whether based on position or not.

In any case, if you look at the seq2seq architectures than preceded it, it's hard to claim that the Transformer is really based-on/evolved-from any of them (especially prevailing recurrent approaches), notwithstanding that it obviously leveraged the concept of attention.

I find the developmental history of the Transformer interesting, and wish more had been documented about it. It seems from interview with Uszkoreit that the idea of parallel language processing based on an hierarchical design using self-attention was his, but that he was personally unable to realize this idea in a way that beat other contemporary approaches. Noam Shazeer was the one who then took the idea and realized it in the the form that would eventually become the Transformer, but it seems there was some degree of throw the kitchen sink at it and then a later ablation process to minimize the design. What would be interesting to know would be an honest assessment of how much of the final design was inspiration and how much experimentation. It's hard to imagine that Shazeer anticipated the emergence of induction heads when this model was trained at sufficient scale, so the architecture does seem to at least partly be an a accidental discovery, and more than the next generation seq2seq model that it seems to have been conceived as.

replies(1): >>41874338 #
7. ansk ◴[] No.41874338{4}[source]
Key-based attention is not attributable to the Transformer paper. First paper I can find where keys, queries, and values are distinct matrices is https://arxiv.org/abs/1703.03906, described at the end of section 2. The authors of the Transformer paper are very clear in how they describe their contribution to the attention formulation, writing "Dot-product attention is identical to our algorithm, except for the scaling factor". I think it's fair to state that multi-head is the paper's only substantial contribution to the design of attention mechanisms.

I think you're overestimating the degree to which this type of research is motivated by big-picture, top-down thinking. In reality, it's a bunch of empirically-driven, in-the-weeds experiments that guide a very local search in a intractably large search space. I can just about guarantee the process went something like this:

- The authors begin with an architecture similar to the current SOTA, which was a mix of recurrent layers and attention

- The authors realize that they can replace some of the recurrent layers with attention layers, and performance is equal or better. It's also way faster, so they try to replace as many recurrent layers as possible.

- They realize that if they remove all the recurrent layers, the model sucks. They're smart people and they quickly realize this is because the attention-only model is invariant to sequence order. They add positional encodings to compensate for this.

- They keep iterating on the architecture design, incorporating best-practices from the computer vision community such as normalization and residual connections, resulting in the now-famous Transformer block.

At no point is any stroke of genius required to get from the prior SOTA to the Transformer. It's the type of discovery that follows so naturally from an empirically-driven approach to research that it feels all but inevitable.