There's a monumental amount of quality data out there that's not indexed, not searchable, and abandoned but unused. We just need to value it enough to use it.
One data source i've been thinking about that i don't know if they've hit yet is all the different agencies local and state agencies and their private and public meetings, ordinances, discourse, etc...
If instead you focus on using the government to outlaw demand, only failure will follow. I mean, didn't the government outlawing the demand for illegal drugs fail miserably? I believe drugs are cheaper, more potent, and more available than ever.
Similarly, if there is demand for compute, then compute will occur. There is always a clearing price commiserate with the risk.
This is big because it would hold novel data the West doesn't access. What is the 'mood' of the average Chinese farmer on Taiwan.
Otherwise it's hard to see how adding more text of the same thing is going to create a revolution.
Video will be something new. But if like "Her" it watches every Twitch stream simultaneously for a month, and is talking to a billion people for a month and still doesn't get it what else is going to happen?
Definitely not when it comes to text. The internet is the largest resource. I'd like to see all books in the Vatican digitalised, if they're not already - probably not though.
Humans receive an enormous amount of training data in forms not currently available to LLMs.
If you locked baby Einstein in a room with the collected works of humanity and left him there for a lifetime, I doubt he'd have even learnt to read on his own.
The greatest advantage you can have in life is a creative mind and I don't believe that is something that can be taught. It can be stomped out of you as a child, but it's not learnable.
The current performance of LLMs on non-English languages is disappointing. Feeding it more non-English material is not going guaranteed to help.
> This is big because it would hold novel data the West doesn't access. What is the 'mood' of the average Chinese farmer on Taiwan.
The average Chinese farmer does not produce textual output of that kind. It is generally not advisable to put your thoughts in writing in oppressive regimes. It could be a life-ending mistake.
> Otherwise it's hard to see how adding more text of the same thing is going to create a revolution.
The LLM gang are like the people who think they can get slimmer by eating more.
> Video will be something new. But if like "Her" it watches every Twitch stream simultaneously for a month, and is talking to a billion people for a month and still doesn't get it what else is going to happen?
Since when does Twitch carry broadcasts that have any value to humanity? Is it used to hold scientific discussions? Or for shooting shit and pushing paid products and services?
Of course not, not any more than he’d learn to program a computer on his own.
Reading and writing are not natural skills. They are a technology that was invented by humans and they must be taught in some capacity. As we learn to read and write, parts of our brain related to language and other skills (which we do possess naturally) are co-opted to enable reading and writing, an unnatural skill.
Intelligence and language, however, are natural human abilities that we have evolved, likely over millions of years. Any parent will tell you this is obvious. It’s amazing how much children are able to infer and learn on their own.
There will be much value in automating the tedious and the routine.
Of course, that doesn't make for a great science fiction story. We first have to placate all these science fiction fantasies and in the process we will automate the tedious and the routine as a side effect of trying to figure out how many AGI can dance on the head of a pin.
Then human creativity will just be worth all the more.
And it’s going to get worse I now that paper files are disappearing.
I wanted some info and data from a test we did 9 years ago. It was a pretty big deal, lots of people involved, many millions of dollars, multiple nasa centers contributing. Every single person on the test randomly kept their own files for the portion of the test they were responsible for. And the only copy of the raw test data was deleted by one of them to save some space when upgrading. There is no record anywhere of what equipment was used for the test.
One of my coworkers has 4 TB external HDD that he keeps everything he has ever worked on. It’s not backed up anywhere else. It just failed and he thought he lost everything, luckily I was able to recover most of it. Wtf.
The chatbots do what infant humans do: mimic what it "sees" until it gets the pattern consistently matching what it saw without any capacity to understand what it is doing.
Once humans have that part done, whole new layers of semantic learning kick in and create the critical analyses we perceive as "intelligence".
LLMs, as a consequence of their design, lack those deeper layers. They are not artificially intelligent at all. Rather, they're the latest iteration of what centuries ago gave us steam-powered songbirds.
Secondarily, I would expect this effect to be swamped by other factors (e.g. conquest).
https://blogs.worldbank.org/en/energy/what-carbon-tax-can-do...
All of the information accumulated by evolution gets passed through DNA. For humans, that's well under 1GB. Probably a very tiny fraction of that determines how the brain works at the algorithmic level. You should think of this information as the "software" of the brain, not pretrained LLM weights (350GB for GPT-3).
Also, if this path to AGI does not work out, its not as though the nuclear reactors will be wasted. People will find something else to do with the energy.
How much of that is cognitively useful for learning English? On top of the textual content, audio gives you emphasis and mood. Not a lot of information in that -- a few bits per sentence.
1GB of text is way more useful for generating text than 1GB of video is.
> training on the right data improves prediction more than training on more data
Books are more useful than Facebook rants. But this is an argument for data scarcity rather than for data abundance.
We have billions of cameras, microphones and IMU/GPS sensors. In-fact one in almost every pocket and desk.
Survival requires intelligence being energy and resource efficient.
Those who build the most powerful and useful models that run locally on edge and are data efficient have a higher chance of winning.
Whoever provides the cheapest fastest most useful models will keep on winning.
Illegal information depends on jurisdiction. Even before you get to governments restricting access to facts, unflattering opinions, or information they consider immoral, you have information considered fraudulent, defamatory and perhaps even treasonous. Secret information might also count here, since governments don't want that ending up in your training data even if they do trust your cloud storage. Illegal information can get you thrown in prison if you seek it out or publish it or sometimes even just read it.
Better even measure in bytes. And remember that kids look at video, not at individual pictures (even if these are videos of pictures).