Out of the loop apparently, could you elaborate? By "the current power" I take you mean the current US administration?
We made a sandwich but it cost you 10x more than it would a human and slower might slowly become faster and more efficient but by the time you get really good at it, its simply not transferable unless the model is genuinely able to make the leap across into other domains that humans naturally do.
I'm afraid this is where the barrier of general intelligence and human intelligence lies and with enough of these geospatial motor skill database, we might get something that mimics humans very well but still run into problems at the edge, and this last mile problem really is a hinderance to so many domains where we come close but never complete.
I wonder if this will change with some sort of computing changes as well as how we interface with digital systems (without mouse or keyboard), then this might be able to close that 'last mile gap'.
And since scraping of publicly available data is not illegal (in the US, according to the aforementioned "lawyer"), it seems like it's okay?
Not legal advice.
[0] https://www.skadden.com/insights/publications/2024/05/distri...
Anyone who has a shred of integrity. I'm not a fan of overreaching copyright laws, but they've been strictly enforced for years now. Decades, even. They've ruined many lives, like how they killed Aaron Swartz.
But now, suddenly, violating copyright is totally okay and carries no consequences whatsoever because the billionaires decided that's how they can get richer now?
If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all these developments should concern you.
if you havent - highly recommended.
https://www.heise.de/en/news/After-criticism-of-AI-training-...
The "Big Beautiful Bill" contains a clause that prohibits state "AI" legislation.
Trump has a "Crypto and AI czar" who is very active in promoting "AI" on his YouTube propaganda outlet. The same czar also promoted, pre-election of course, accelerated peace with Russia and then stopped talking about the subject altogether.
> the core insight: predict in representation space, not pixels
We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).
> zero-shot generalization (aka the money shot)
This is easily beaten by flow-matching imitation learning models like what Pi has.
> accidentally solved robotics
They're doing 65% success on very simple tasks.
The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.
Hello there! As a fellow gen-z douchebag, the article looks authentic, albeit a bit slim on Discord screencaps. Will be fun(?) to be proven wrong though.
My first thought upon reading this was that an LLM had been instructed to add a pithy meme joke to each paragraph. They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.
There's also a sense of incoherence in the whole piece. For instance, this section:
"- after: 22 million videos + 1 million images (now we're talking)
they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos"
Was it a billion vids or 22m? It turns out the latter sentence is just rephrasing the list of sources in a cool casual way, and the last one is called YT-Temporal-1B. That's a billion frames of video, not a billion videos.
His death was a tragedy but it wasn't done to him.
The research is very real but the blog post appears to be very fake.
This writing style is prominent on Twitter and niche Discords. It's funny how much I've come to be able to cut right through it, but if you haven't seen much of it it's really hard to parse. That's by design, too. The vibe of this writing style is to project an air of confidence so strong that the author doesn't care if you get it or not. It's a sort of humblebrag where the writing is supposed to flex the author's understanding of the subject while also not caring if you get it or not.
As others have already covered, there's also some heavy stretching of the truth and rewriting of history going on in this post. That's also common of the extreme bravado in this style of semi-impenetrable writing: The vagueness and ambiguities allow the author to make grandiose claims but then wiggle out of them later if someone is astute enough to catch on.
For example: The blog post is written as “We…” but is the author part of the team? Or is he using “we” meaning society in general?
I can’t imagine why you’d let the FBI off the hook
I think that's what was done to Aaron Swartz.
I think it's still pretty impressive in its recoveries, even though there's an unnaturally large number of them necessary. About 8 seconds into the video on the homepage, it almost misses and ends up slipping off the second step. I've eaten shit at missing a couple inch curb, though I don't think "graceful" has ever been used as a descriptor for me. So the fact that it just recovers and keeps going without issue is impressive to me.
Swartz was ill. It is a tragedy he did not survive the experience, and indeed, trial is very stressful. But he was no more hounded than any defendant who comes under federal scrutiny and has to defend themselves in a court of law via the trial system. Kevin Mitnick spent a year in prison (first incarceration) and survived it. Swartz was offered six months and committed suicide.
I don't know how much we should change of the system to protect the Aaron Swartzs of the world; that's the mother of all Chesterton's Fences.
For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.
Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.
With that said, I suppose a robot can be made to practice in real life after learning something from vision.
You may be surprised to find out how incorrect this.
I can think of two popular conservative sites likely to quote Doge people off hand that do this. I read all news in order not be an insufferable ideologue. So again, off the top of my head, NotTheBee (I think affiliated to BabylonBee (conservative The Onion)) and Twitchy. Among YouTubers, I think Asmond Gold, and I’m sure others like Steven Crowder who himself is in a famous meme.
That said… yea, you are probably right.
I hope I'm wrong, but this looks like an effort to normalize such writing style. As this happens, intelligent discourse and rhetoric become harder.
For a single example, in any factory watch how humans are added as ad-hoc machines wherever a problem occurs. Machine N outputting faster than machine N+1 can accept? Have a human stack, and destack, the product between them. No matter the size, shape, it within reason the weight of the product. But most importantly: the process can begin within seconds of the problem occurring. No need for a programmer, developer, or maintenance worker to get involved. Just a clear order from the shift manager.
A general purpose robot with physical interfaces similar to a human would be very valuable for such environments. If it had the software to be as easy to instruct as a human.
I'm pretty sure that's just a matter of reaction speed and it maintaining a constant focus/vigilance on it's movement that you'd usually not reserve outside of some sports and situations pre-identified as deserving the attention due to danger, like concentrating on balance and not getting into a position that overstresses your joints when you know it's icy.
This is a type of information arbitrage where someone samples something intellectual without fully understanding it, then writes about it for a less technical audience. Their goal is to appear to be the expert on the topic, which translates into clout, social media follows, and eventually they hope job opportunities.
The primary goal of the writing isn’t to get you to understand the topic clearly, because that would diminish the sense that the author is more knowledgeable than you. The goal is to sound guru-like while making the topic feel impenetrably complex for you, while appearing playfully casual for the author.
so you just kinda let it run for a while and it bumps and squirms around until it stands up or whatever.
seems also the future for real ai?
Physics informed training is a real methodology (simple introduction to the subject: https://www.youtube.com/@Eigensteve/videos ).
However, the slop article is 80% nonsense. =3
In my country, far righters are displaying the country’s flag everywhere. Now you can’t display a French flag without being thought as a far right person. That’s honestly insufferable.
I know it’s less important with doge but still : before being a crypto it was just a picture of an overly innocent and enthusiastic dog. And even when it became a little crypto, it was totally assumed that it was a meme coin and wasn’t meant for speculation, the idea was that 1DOGE = 1DOGE only and people gifted them to other people who made nice contributions on the internet.
Musk broke all of this when it started to use it to do gigantic pumps and dumps using his own visibility on Twitter.
We don’t have to let fascism steal all the popular symbols / memes, because they will steal them anyway.
But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.
What about using Flux Kontext (or Controlnets) to turn the messy kitchen into a clean kitchen?
I'm not sure that's necessarily true for a lot of tasks.
A good way to measure this in your head is this:
"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.
Saying that we should not work on cures for pneumonia because it's a Chesterton Fence is obviously, blatantly, illogical. Saying that we should change the system so that government officials working for moneyed interests can't hound someone to death is similarly illogical.
(/joke)
Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.
I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.
I dabbled a bit in geolocation with LLM's recently. It is still surprising to me how good they are with finding the general area a picture was taken. Give it a photo of a random street corner on this earth and it is likely will not only tell you the correct city or town but most often even the correct quarter.
On the other hand, if you ask it for a birds eye view of a green, a brown and a white house on the north side of a one-way street (running west to east) east of an intersection running north to south, it may or may not get it right. If you want it to add an arrow going in the direction of the one-way street, it certainly has no clue at all and the result is 50/50.
I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.
And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.
> It therefore follows that robots should be able to learn with just RGB images too!
That does not follow at all! It's not how you learned either.
Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.
The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information
If you’re in the US, you likely work with them and they have learned to studiously avoid talking about politics except in vagaries to avoid conflict.
> very scientific. much engineering.
Emphasis on attempt because you're supposed to use words with grammatically incorrect modifiers, and the first one doesn't. (Even the second one doesn't seem entirely incorrect to me? I'm not a native speaker though.) "many scientific, so engineering" for example would have worked.
I assume they, or most likely their LLM, tried too hard to follow the most popular sequence (very, much, wow) and failed at it.
>the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
Reminds that several years ago Tesla had to finally start to explicitly extract 3D model from the net. Similarly i expect here that it would get pipelined - one model extracts/builds 3D, and the other is actually the "robot" working in that 3D. Each one can be alone trained much better and efficiently, with much better transfer and generalization, than the large monolithic model working from the 2D video. In pipeline approach, it is very easy to generate synthetic input 3D data better covering interesting scenario space for the "robot" model.
And, for example, you can't just, without significant training, feed the large monolithic model a lidar point space instead of videos. Whereis in a pipelined approach, you just switch the 3D generating pipeline input model.
Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:
Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.
I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.
Then we can maybe start talking about robotics.
And it's a point of semantics, but no; we generally don't say people who died by suicide died by the things going on in their life when they ended it. Everybody has stressors. The suicidal also have mental illness. Mr. Swartz had self-documented his past suicidal ideation.
But besides that, you‘re totally right. It’s too „loose“ since to realize that idea the process would have to be way different (and properly explained)
If I had to guess, it seems likely that there will be a serious cultural disconnect as 20-something deep learning researchers increasingly move into robotics, not unlike the cultural disconnect that happened in natural language processing in the 2010s and early 20s. Probably lots of interesting developments, and also lots of youngsters excitedly reinventing things that were solved decades ago.
1. First create a model that can evaluate how well a task is going; the YT approach can be used here.
2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.
There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.
If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.
>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.
>> long-horizon drift
>> try to plan more than a few steps ahead and the model starts hallucinating.
That is to say, not quite ready for the real world, V-JEPA 2 is.
But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
https://arxiv.org/abs/2506.09985
In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.
https://www.youtube.com/watch?v=4xmckWVPRaI
Capitalia tantum.
Someone's getting peckish :P
Cringely, they are. Nobody who isn't desperate to appear cool would write in that terminally grating register, including when using an LLM to do the writing.
> Pure vision will never be enough because it does not contain information
Say it louder for those in the back!But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:
You cannot create causal models through observation alone.
This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).
We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.
Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.
But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"
I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.
Tldr:
if you could do it from observation alone, physics would have been solved a thousand years ago
There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss
[0] https://hermiene.net/essays-trans/relativity_of_wrong.html
[1] maybe this is why there are so few physicists working on the world modeling side of ML. At least, using that phrase...
> because you as a human have really good intuition about the world.
This is the line that causes your logic to fail.You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.
The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."
[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...
> LLMs already that generalizability
This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.Also, LLMs these days aren't trained on just language
> if you are fluent in the jargon surrounding state of the art LLMs and deep learning
It is definitely not following that jargon. Maybe it follows the tech influencer blog post jargon but I can definitively say it doesn't follow jargon used in research. Which, they are summarizing a research paper. Consequently they misinterpret things and use weird phrases like "actionable physics," which is self referential. "A" physics model is necessarily actionable. It is required to be a counterfactual model. While I can understand the rephrasing to clarify to a more general audience that's a completely different thing than "being fluent in SOTA work." It's literally the opposite...Also, it definitely doesn't help that they remove all capitalization except in nouns.
> Doesn't seem to be grounded at least with reality and my personal experience with robotics.
It also doesn't match my personal experience with physics nor ML, and I have degrees in both.You cannot develop accurate world models through observation alone, full stop.
You cannot verify accurate world models through benchmarks alone, full stop.
These have been pain points in physics for centuries and have been the major pain point even before the quantum revolution. I mean if it were possible, we'd have solved physics long ago. You can find plenty of people going back thousands of years boldly claiming "there is nothing new to be learned in physics," yet it was never true and still isn't true even if we exclude quantum and relativity.
Side note: really the paper is "fine" but I wish we didn't put so much hype in academic writing. Papers should be aimed at other academics and not be advertisements (use the paper to write advertisements like IFLS or Quanta Magazine, but don't degrade the already difficult researcher-to-researcher communication). So I'm saying the experiments are fine and the work represents progress but it is over sold and the conclusions do not necessarily follow
Btw, the paper makes these mistakes too. It makes a very bold assumption that counterfactual models (aka a "world model") are learned. This cannot be demonstrated through benchmarking, it must be proven through interpretability.
Unfortunately, the tail is long and heavy... you don't need black swan events to disrupt these models and boy does this annoying fact make it easy to "hack" these types of models. And frankly, I don't think we want robots operating in the wild (public spaces, as opposed to controlled spaces like a manufacturing floor) if I can make it think an iPhone is an Apple with just a stickynote. Sure, you can solve that precise example but it's not hard to come up with others. It's a cat and mouse game, but remember, Jerry always wins.
And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).
This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/
Things that randomly change shape or appearance are also very difficult to interact with safely. The force sensing platform from Universal Robots is safer for users, but it has limitations like all platforms. =3
[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration - John Hooker - CMU (2023) [video]:
https://www.youtube.com/live/TknN8fCQvRk
[2] "We Really Don't Know How to Compute!" - Gerald Sussman - MIT (2011) [video]:
https://youtube.com/watch?v=HB5TrK7A4pI
[3] Google OR-Tools:
https://developers.google.com/optimization
[4] MiniZinc:
Reality: Most value is in shrinking things, excluding humans, automating management, carefully designed process, and specialist hardware that does a subset of things very well. Relying on human(oid)s is a sure-fire way to suck.
What should the government (executive or judicial) have done differently to balance the needs of the accused vs. the needs of the enforcement and adjudication of the law here?
Perhaps we could craft a way to hold people with mental health issue to the same standards we are all held to while simultaneously being more sensitive to their needs. But in general, his story is an unfortunate tragedy of a sick person who took their own life under a stress that doesn't kill most other people, and we adjust the way we prosecute crime at our own peril. It is, as I said elsewhere, the mother of all Chesterton's Fences. Which is not to say it cannot or should not be improved! Only that it be done with great care.
And to be completely clear: Swartz ripped content via back-dooring a secured network physically, in a closet, and (it is alleged) planned to dump that content in public. We'll never really know since he (or his illness) denied himself his day in court, and that's a tragedy; he may have successfully defended himself, or could have been a living example of persevering anyway like Mitnick instead of a martyr. Companies using their authorized accounts to scrape Google are likely at most guilty of a TOS violation and Google may choose to cut their accounts, but it's very hard to make a case that the Google API saying, over and over again, "Yes you may view that video" constitutes either unauthorized access or exceeding the bounds of access under 9-48.000.
It's hard to comment on whether Swartz violated the CFAA. Since he wasn't tried, we'll never really know. He exited life before justice could happen one way or the other.
I agree that a system of laws has benefit to society. However the system we've worked out for making such laws is clearly being warped and twisted to serve one small section of society at the expense of everyone else.
A clear case being the comment that started this conversation - Swartz was hounded to death for doing the exact same thing that AI companies are doing and they're facing zero punishment. AI executives are not being dragged from their offices by burly policemen and thrown into cells, yet they have done the exact same thing that Swartz did to merit that behaviour. It's not unreasonable to question the societal benefit of this system.
And we totally should say that people died of depression, or financial stress, or legal persecution, or whatever. Most people have suicidal ideation at some point in their lives, that's not unusual. Being hassled to the point where you go through with it is definitely violence. Classing this as "mental illness" and therefore a personality defect is a form of victim blaming.
It's not, and I don't think you're seriously arguing this point so I'm going to ignore it.
It is, I think, a reasonable observation that had Swartz formed an LLC to pursue advanced analysis of academic papers for, I don't know, trends in the language used in research and slurped bunch of JSTOR for that purpose, the trial would have taken longer and involved more lawyers. That's probably an observation that should give us pause. Or not, because nobody argued that's what he did or that was his intent, including him. So I also think the premise of comparison to the current circumstances is flawed; I don't think the CFAA can be applied in a context where people have access rights and go through Google's front door to scan videos for the purpose of training a machine learning algorithm. It might be a TOS violation. It's not hiding a server in a closet with unauthorized physical access, which is what Swartz was accused of.
Intent matters, and, sadly, we never got to the trial where intent could have been proven out.
> Being hassled to the point where you go through with it is definitely violence.
The government does have the monopoly on violence. But I think what happened to Swartz is a far cry from that, as he never got to sentencing, much less trial. There was some light compulsion (requirement to appear in court), of course. But everyone who's ever wanted to contest a parking ticket has to experience that. Sadly, this train of thought goes into a station of "Swartz should have been under professional care if his condition was this much a danger to him," and I don't know how the government should change its behavior if he wasn't. Prosecutors are not prognosticators of the mental health of defendants, and I've never read anywhere that Swartz wanted to be committed for mental illness.
Our system is much harder for defendants grappling with mental illness; I'll acknowledge and argue for change regarding that. I don't know that such change would conclude with "Swartz should never have been accused of committing a crime that a lot of evidence suggests he committed," however.
Your point that Swartz would have had a different result had he formed an LLC, and hired a bunch of lawyers, is definitely the key point here. A legal system that only works for the rich and powerful is not something we should defend, support, or put up with.
His purpose in copying research papers and making them available for free is massively more in the public interest than anything the AI companies are doing. They are, after all, seeking to make a profit at the end of this. And they knowingly and deliberately broke copyright law because it was "too hard" to make any kind of licensing deal with the publishers. You can argue about fair use and transformative purposes (as their lawyers have done), but you can also argue from Swartz's point of view that this information was (to a large extent) publicly funded and therefore belonged to the public, and trying to get the journals to acknowledge that is "too hard". And had he been able to afford lawyers, that's a possible line they could have taken. But he didn't get the chance. As you say, we never got to the trial so we will never know.
It's definitely not a stretch to say that his crime and the AI companies' crimes (which they admit to - they admit to downloading source texts from pirate sites) are comparable, even equivalent. Yet their treatment is not.
My understanding of his treatment is that it was a lot more than "light compulsion" and that he underwent a sustained campaign of enforcement activity and litigation at the hands of a specific prosecutor. But given that the AI companies have had nothing - no criminal charges - just a civil case brought by the authors they admit to ripping off, then I don't think I need to push this point. They are clearly being treated differently to him, despite the similar actions.
That's the thing about copyright; it's a whole category of law more based in utility than morality. One of the reasons AI is such a fight right now is that nobody was opposing it as an academic project when it was generating, for example, tools that could go from an image to describing the image, or from an image to recognizing the likely artistic style and helping somebody find the original artist. But with just a few tweaks those tools became devices for generating novel images, and now people are upset. Intent matters.
And again, you are drawing equivalence between harvesting data from openly accessible sources online and hiding a server in a closet with unauthorized physical access to a network. Swartz's prosecution wasn't accusing him of copyright violation; it was accusing him of compromising a network. A far more serious charge; if the researchers in the story here had collected those YouTube videos by wiretapping the fiber optics between two of Google's data centers I suspect they would have concerns.