Interview with gwern

(www.dwarkeshpatel.com)

Show context

YeGoblynQueenne ◴[14 Nov 24 13:22 UTC] No.42135916[source]▶

This will come across as vituperative and I guess it is a bit but I've interacted with Gwern on this forum and the interaction that has stuck to me is in this thread, where Gwern mistakes a^nb^n as a regular (but not context-free) language (and calls my comment "not even wrong"):

https://news.ycombinator.com/item?id=21559620

Again I'm sorry for the negativity, but already at the time Gwern was held up by a certain, large, section of the community as an important influencer in AI. For me that's just a great example of how basically the vast majority of AI influencers (who vie for influence on social media, rather than research) are basically clueless about AI and CS and only have second-hand knowledge, which I guess they're good at organising and popularising, but not more than that. It's easy to be a cheer leader for the mainstream view on AI. The hard part is finding, and following, unique directions.

With apologies again for the negative slant of the comment.

replies(10): >>42136055 #>>42136148 #>>42136538 #>>42136759 #>>42137041 #>>42137215 #>>42137274 #>>42137284 #>>42137350 #>>42137636 #

aubanel ◴[14 Nov 24 13:41 UTC] No.42136055[source]▶

>>42135916 #

> For me that's just a great example of how basically the vast majority of AI influencers (who vie for influence on social media, rather than research) are basically clueless about AI and CS

This is a bit stark: there are many great knowledgeable engineers and scientists who would not get your point about a^nb^n. It's impossible to know 100% of of such a wide area as "AI and CS".

replies(2): >>42136162 #>>42136565 #

1. nocobot ◴[14 Nov 24 13:56 UTC] No.42136162[source]▶

>>42136055 #

is it really? this is the most common example for context free languages and something most first year CS students will be familiar with.

totally agree that you can be a great engineer and not be familiar with it, but seems weird for an expert in the field to confidently make wrong statements about this.

replies(2): >>42136390 #>>42139265 #

2. YeGoblynQueenne ◴[14 Nov 24 14:26 UTC] No.42136390[source]▶

>>42136162 (TP) #

Thanks, that's what I meant. a^nb^n is a standard test of learnability.

That stuff is still absolutely relevant, btw. Some DL people like to dismiss it as irrelevant but that's just because they lack the background to appreciate why it matters. Also: the arrogance of youth (hey I've already been a postdoc for a year, I'm ancient). Here's a recent paper on Neural Networks and the Chomsky Hierarchy that tests RNNs and Transformers on formal languages (I think it doesn't test on a^nb^n directly but tests similar a-b based CF languages):

https://arxiv.org/abs/2207.02098

And btw that's a good paper. Probably one of the most satisfying DL papers I've read in recent years. You know when you read a paper and you get this feeling of satiation, like "aaah, that hit the spot"? That's the kind of paper.

replies(1): >>42137043 #

3. GistNoesis ◴[14 Nov 24 15:22 UTC] No.42137043[source]▶

>>42136390 #

a^nb^n can definitely be expressed and recognized with a transformer.

A transformer (with relative invariant positional embedding) has full context so can see the whole sequence. It just has to count and compare.

To convince yourself, construct the weights manually.

First layer :

zeros the character which are equal to the previous character.

Second layer :

Build a feature to detect and extract the position embedding of the first a. a second feature to detect and extract the position embedding of the last a, a third feature to detect and extract the position embedding of the first b, a fourth feature to detect and extract the position embedding of the last b,

Third layer :

on top that check whether (second feature - first feature) == (fourth feature - third feature).

The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

If you train by only showing example with varying n, there probably isn't inductive bias to make it converge naturally towards the optimal solution you can construct by hand. But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You can't deduce much from negative results in research beside it requiring more work.

replies(1): >>42137190 #

4. YeGoblynQueenne ◴[14 Nov 24 15:33 UTC] No.42137190{3}[source]▶

>>42137043 #

>> The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

They do. That's the whole point of the paper: you can set a bunch of weights manually like you suggest, but can you learn them instead; and how? See the Introduction. They make it very clear that they are investigating whether certain concepts can be learned by gradient descent, specifically. They point out that earlier work doesn't do that and that gradient descent is an obvious bit of bias that should affect the ability of different architectures to learn different concepts. Like I say, good work.

>> But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You could always try it out yourself, you know. Like I say that's the beauty of grammars: you can generate tons of synthetic data and go to town.

>> You can't deduce much from negative results in research beside it requiring more work.

I disagree. I'm a falsificationist. The only time we learn anything useful is when stuff fails.

replies(1): >>42139528 #

5. aubanel ◴[14 Nov 24 18:14 UTC] No.42139265[source]▶

>>42136162 (TP) #

In my country (France), I think most last-year CS students will not have heard of it (pls anyone correct me if I'm wrong).

6. GistNoesis ◴[14 Nov 24 18:37 UTC] No.42139528{4}[source]▶

>>42137190 #

Gradient descent usually get stuck in local minimum, it depends on the shape of the energy landscape, that's expected behavior.

The current wisdom is that by optimizing for multiple tasks simultaneously, it makes the energy landscape smoother. One task allow to discover features which can be used to solve other tasks.

Useful features that are used by many tasks can more easily emerge from the sea of useless features. If you don't have sufficiently many distinct tasks the signal doesn't get above the noise and is much harder to observe.

That the whole point of "Generalist" intelligence in the scaling hypothesis.

For problems where you can write a solution manually you can also help the training procedure by regularising your problem by adding the auxiliary task of predicting some custom feature. Alternatively you can "Generatively Pretrain" to obtain useful feature, replacing custom loss function by custom data.

The paper is a useful characterisation of the energy landscape of various formal tasks in isolation, but doesn't investigate the more general simpler problem that occur in practice.

↑