If the intern "had no experience with the AI lab", is it the right thing to do to fire them, instead of admitting that there is a security/access fault internally? Can other employees (intentionally, or unintentionally) cause that same amount of "damage"?
https://x.com/le1du/status/1847144170705785239
Rumor says an intern at ByteDance was jailed for sabotaging their GPU cluster. Over 8000 H100 GPUs ran corrupted code for a month , all because he was frustrated with resources being diverted from his research to a GenAI project.
was told the intern used a bug in hugginface's load ckpt function to inject bad code. The code randomly change other tasks' parameter and get them sleep, only targeting training tasks using more than 256 cards
You could track down the direct Chinese rumor, but you'd have to leave the cyber basement. Big nono for HN, it can't even eat Americanized Chinese digital food like TikTok ( Chinese version - https://portal.sina.com.hk/others/sina/2024/10/20/1013680/%E... )Interesting that my experience has been the exact opposite.
Whenever I’ve participated in COE discussions (incident analysis), questions have been focused on highlighting who made the mistake or who didn’t take the right precautions.
One thing I suspect investors in e.g. OpenAI are failing to price in is the political and regulatory headwinds OpenAI will face if their fantastical revenue projections actually materialize. A world where OpenAI is making $100B in annual revenue will likely be a world where technological unemployment looms quite clearly. Polls already show strong support for regulating AI.
As a sibling said you were likely in a bad or or one that was using COEs punatively.
You're not firing the person because they broke stuff, you are firing them because they tried to break stuff. If the attempt was a failure and caused no harm, you would still fire them. Its not about the damage they caused its that they wanted to cause damage.
I do not see any mention of other legal action and article is shallow.
It might’ve been that someone in command chain called it “malicious” to cover up his own mistakes. I think that is parent poster point while writing out Amazon story.
I'm trying to think of whether it'd be worth starting some kind of semi-Luddite community where we can use digital technology, photos, radios, spreadsheets and all, but the line is around 2014, when computers still did the same thing every time. That's my biggest gripe with AI, the nondeterminism, the non-repeatability making it all undebuggable, impossible to interrogate and reason about. A computer in 2014 is complex but not incomprehensible. The mass matrix multiplication of 2024 computation is totally opaque and frankly I think there's room for a society without such black box oracles.
> TikTok owner, ByteDance, says it has sacked an intern for "maliciously interfering" with the training of one of its artificial intelligence (AI) models.
Are there other links with additional info?
> ByteDance also denied reports that the incident caused more than $10m of damage
It makes clear what ByteDance's official position is, while pretty clearly hinting that it might not be true.
https://juejin.cn/post/7426926600422637594
> He exploited the vulnerability of huggingface's load ckpt function to inject code, dynamically modifying other people's optimizer to randomly sleep for a short period of time, and modifying the direction of parameter shaving. He also added a condition that only tasks with more than 256 cards would trigger this condition.
Okay yeah that's malicious and totally a crime. "modifying the direction of parameter shaving" means he subtly corrupted his co-workers work. that's wild!
That absolutely kills open source, and it's disguised as a "safety" bill where safety means absolutely nothing (how are you "shutting down" an LLM?). There's a reason Anthropic was championing it even though it evidently regulates AI.
Summary:
10/18:
Translation of the provided text:
Title: Urgent Warning
The “reputation washing” behavior of Tian Keyu has been extremely harmful
For the past two months, Tian Keyu has maliciously attacked the cluster code, causing significant harm to nearly 30 employees of various levels, wasting nearly a quarter’s worth of work by his colleagues. All records and audits clearly confirm these undeniable facts:
1. Modified the PyTorch source code of the cluster, including random seeds, optimizers, and data loaders.
2. Randomly killed multi-machine experiment processes, causing significant experiment delays.
3. Opened login backdoors through checkpoints, automatically initiating random process terminations.
4. Participated in daily troubleshooting meetings for cluster faults, continuing to modify attack codes based on colleagues’ troubleshooting ideas.
5. Altered colleagues’ model weights, rendering experimental results unreproducible.
It’s unimaginable how Tian Keyu could continue his attacks with such malice, seeing colleagues’ experiments inexplicably interrupted or fail, after hearing their debugging strategies and specifically modifying the attack codes in response, and witnessing colleagues working overnight with no progress. After being dismissed by the company, he received no penalties from the school or advisors and even began to whitewash his actions on various social media platforms. Is this the school and advisors’ tolerance of Tian Keyu’s behavior? We expect this evidence disclosure to attract the attention of relevant parties and for definitive penalties to be imposed on Tian Keyu, reflecting the social responsibility of higher education institutions to educate and nurture.
We cannot allow someone who has committed such serious offenses to continue evading justice, even beginning to distort facts and whitewash his wrongdoing! Therefore, we decide to stand on behalf of all justice advocates and reveal the evidence of Tian Keyu’s malicious cluster attack!
Tian Keyu, if you deny any part of these malicious attack behaviors, or think the content here smears you, please present credible evidence! We are willing to disclose more evidence as the situation develops, along with your shameless ongoing attempts to whitewash. We guarantee the authenticity and accuracy of all evidence and are legally responsible for the content of the evidence. If necessary, we are willing to disclose our identities and confront Tian Keyu face-to-face.
Thanks to those justice advocates, you do not need to apologize; you are heroes who dare to speak out.
Link to the inquiry recording of Tian Keyu: https://www.youtube.com/watch?v=nEYbYW--qN8
Personal homepage of Tian Keyu: https://scholar.google.com/citations?user=6FdkbygAAAAJ&hl=en
GitHub homepage of Tian Keyu: https://github.com/keyu-tian
10/19:
Clarification Regarding the “Intern Sabotaging Large Model Training” Incident
Recently, some media reported that “ByteDance’s large model training was attacked by an intern.” After internal verification by the company, it was confirmed that an intern from the commercial technology team committed a serious disciplinary violation and has been dismissed. However, the related reports also contain some exaggerations and inaccuracies, which are clarified as follows:
1. The intern involved maliciously interfered with the model training tasks of the commercial technology team’s research project, but this did not affect the official commercial projects or online operations, nor did it involve ByteDance’s large model or other businesses.
2. Rumors on the internet about “involving over 8,000 cards and losses of millions of dollars” are greatly exaggerated.
3. Upon verification, it was confirmed that the individual in question had been interning in the commercial technology team, and had no experience interning at AI Lab. Their social media bio and some media reports are incorrect.
The intern was dismissed by the company in August. The company has also reported their behavior to the industry alliance and the school they attend, leaving further actions to be handled by the school.
Zvi says this claim is false: https://thezvi.substack.com/p/guide-to-sb-1047?open=false#%C...
>how are you "shutting down" an LLM?
Pull the plug on the server? Seems like it's just about having a protocol in place to make that easy in case of an emergency. Doesn't seem that onerous.
If you look at what he did it was definitely 100% actively malicious. For instance, his attack only executes when running on >256 GPUs. He inserted random sleeps to slow down training time and was knowledgeable enough to understand how to break various aspects of the loss function.
He then sat in meetings and adjusted his attacks when people were getting close to solving the problem.
If you don’t know what happened and can’t ask more details about it, how can you possibly reduce the likelihood (or impact) of it in the future?
Finding out in detail who did it does not require you to punish that person and having a track record of not punishing them helps you find out the details in future incidents.
And sorry, people are not "gullible" for disbelieving the media. I have worked at most big tech companies and the media misreports so badly on easily verifiable things in my area of expertise, that I no longer trust them on much. https://en.m.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_a...
Also, what kind of outfit is ByteDance if an intern can modify (and attack) runs that are on the scale of 256 GPUs or more? We are talking at least ~USD 8,000,000 in terms of the hardware cost to support that kind of job and you give access to any schmuck? Do you not have source control or some sort of logging in place?
2014 is when I became aware of gradient descent and how entropy was used to search more effectively, leading to different runs of the same program arriving at different results, Deep Dream came soon after and it's been downhill from there
If I were to write some regulations for what was allowed in my computing community I would make an exception for using PRNGs for scientific simulation and cryptographic purposes, but definitely I would draw a line at using heuristics to find optimal solutions. Slide rules got us to the moon and that's good enough for me.
The AI advocates actively advertised AI as a tool for replacing creatives, including plagiarizing their work, and copying the appearance and voices of individuals. It's not really surprising that everyone in the creative industries is going to use what little power they have to avoid this doomsday scenario.
"No blame, but no mercy" is one of these adages; while you shouldn't blame individuals for something that is an organization-wide problem, you also shouldn't hold back in preventing it from happening again.
Rumors said that his motivation would be to just actively sabotage colleague's work because managers decided to give priority on GPU resources to those who were working on DiT models, and he works on autoregressive image generation. I don't know what exactly was his idea, maybe he thought that by destroying internal competitors' work he can get his GPU quotas back?
> Also, what kind of outfit is ByteDance if an intern can modify (and attack) runs that are on the scale of 256 GPUs or more?
Very high. These research labs are basically run on interns (not by interns, but a lot of ideas come from interns, a lot of experiments executed by interns), and I actually mean it.
> Do you not have source control or some sort of logging in place?
Again, rumors said that he gained access to prod jobs by inserting RCE exploits (on unsafe pickle, yay, in 2024!) to foundation model checkpoints.
Also sort of as you also hinted, you can't exactly lump these top-conference scoring PhD student-equivalents with typical "interns". Many are extremely capable. ByteDance wants to leverage their capabilities, and likely wants to leverage them fast.
I do understand that interns (who are MSc and PhD students) are incredibly valuable as they drive progress in my own world too: academia. But my point was not so much about access to the resources, as the fact that apparently they were able to manipulate data, code, and jobs from a different group. Looking forward to future details. Maybe we have a mastermind cracker on our hand? But, my bet is rather on awful security and infrastructure practices on the part of ByteDance for a cluster that allegedly is in the range of ~USD 250,000,000.
Most major Western news media are sourcing at least some China stories from WeChat and Sina Weibo before it gets scrubbed by censors.
> my bet is rather on awful security and infrastructure practices
For sure. As far as I know ByteDance does not have an established culture of always building secure systems.
You don't need to be a mastermind cracker. I've used/built several systems for research computing and the defaults are always... less than ideal. Without a beefier budget and a lot of luck (cause you need the right people) it's hard to have a secure system while maintaining a friendly, open atmosphere. Which, as you know, is critical to a research lab.
Also,
> from a different group
Sounds like it was more like a different sub-team of the same group.
From what I heard I'd also argue that this could be told as a weak supply chain attack story. Like, if someone you know from your school re-trained a CLIP with private data, would you really think twice and say "safetensors or I'm not going to use it"?
Btw one of the rumors has that it is even difficult to hire engineers to do training/optimization infra at one of those ML shops -- all they want to hire are pure researcher types. We can imagine how hard it will be to ask for resources to tighten up security (without one of these incidents).
In some cases like interns we probably just took their commit access away or blocked their direct push access. Now a days interns can't touch critical systems and can't push code directly to prod packages.
What's the explanation? That they are explicitly allowed for some strategical reason? Something else?
Edit: @dang: Sorry in advance. I do feel like we got some pretty good discussion around this explosive topic, at least in its first hour.
Folks, keep up the good behavior — it makes me look good.
China, and all other (supposedly) top-down-economies, survive only because their control is not airtight. If they were to actually have complete control, things would fall apart rapidly. “No one knows how Paris is fed” and all that.
"the kind of control you're attempting simply is... it's not possible. If there is one thing the history of evolution has taught us it's that life will not be contained."
Humans are clever and typically find workarounds given enough time/hope. Sure you could argue that this is some kind of authoritarian 4D chess/matrix scenario to let off steam for an unruly populace, or it's just the natural course of things.
https://www.bloomberg.com/opinion/articles/2023-08-14/china-...
Remember this: freedom is a pure idea. It occurs spontaneously and without instruction. Random acts of insurrection are occurring constantly throughout the galaxy. There are whole armies, battalions that have no idea that they’ve already enlisted in the cause.
Remember that the frontier of the Rebellion is everywhere. And even the smallest act of insurrection pushes our lines forward.
And remember this: the Imperial need for control is so desperate because it is so unnatural. Tyranny requires constant effort. It breaks, it leaks. Authority is brittle. Oppression is the mask of fear.
Remember that. And know this, the day will come when all these skirmishes and battles, these moments of defiance will have flooded the banks of the Empires’s authority and then there will be one too many. One single thing will break the siege.
Remember this: try.
It's mainly just that there's more politically motivated manipulation... versus in the west where those tools would be used on things like copyright infringement, pornography, and misinformation etc.
If something is totally forbidden, that holds.
However, the government doesn’t want people to feel oppressed beyond the explicitly forbidden.
What happens instead is, if it’s unfavorable but not forbidden, it will be mysteriously downvoted and removed, but if it keeps bubbling up, the government says “okay clearly this is important to people” and leaves it up.
This happened with some news cases of forced marriage in some rural mountain regions, and the revelation that a popular WeChat person (like YouTuber) was involved with one of the families.
All communication software (QQ/Wechat are the two most used) have sort of backend scanner that detects topics that are in the "in-season" blacklist and ban accounts accordingly. No one knows what the list is so people could get banned for random reasons, but in general bashing current policies or calling out names of the standing members of Politburo is the quickest way to get banned -- and in many instances also got the Wechat group banned.
On the other side, surprisingly, there are many contents that are apparently inappropriate floating on the social media without getting banned. This also throws people off feet.
What I gathered is:
- Don't shit on current party leaders. Actually don't discuss current politics at all. AIs don't always recognize contents correctly so you could be banned for supporting one side or desisting it at the same time.
- Don't ever try to call up other people to join whatever an unofficial cause, whatever it is. Like, even if it's purely patriotic, just don't do it. You do it and you might go to prison very quickly -- at least someone is going to call you to STFU. Grassroot movements is the No.1 enemy of the government and they don't like it. You have to go through official channels for those.
This leads to the following conclusion:
Essentially, the government wants as much control as possible. You want to be patriotic? Sure, but it has to be controlled patriotic. You can attend party gathering to show your patriotism, but creating your own, unofficial gathering is a big No. They probably won't put you into a prison if the cause is legit, but police are going to bug you from time to time ->
IMO this is how the CCP succeed. It has successfully switched from an ideologic party to an "All-people" party. It doesn't really care about ideology. But it wants to assimilate everyone who potentially can be out of control. If you are a successful businessman, it will invite you to participate in political life. If you are an activist who can call up thousands of people, it wants you in. It is essentially, a cauldron of elitists. It has nothing to do with "Communism". It is essentially, GOP + DEM in the US.
Also an analogy re how the image is of communist central planning, but post Deng, it's maybe even more of a freewheeling capitalist economy in some regions than the US....(especially in Shenzhen - see Bunnie Huang's write-ups of the ecosystem/economies there)
"Item number 12. We feel like this URL is hurtful to the Chinese people"
I don’t see any implication of this news which would undermine their society, or cause disruption, or make people riot. If anything it is a tepid warm “do your job correctly and don’t be too clever by half or else…” story.
Why would they flex their muscles for this one?
1. There is no such thing as a single entity of government, CCP is not a person, each individual member of the party and government has his/her own agenda. Each level of government has its own goals. But ultimately it's about gaining control and privileges.
2. It is impossible to control 1.3-1.4 billion people all the time, so you make compromises.
3. The main point is: the tight control is both for and rooted from hierarchical power. To put it plainly, anything goes if it doesn't undercut CCP's control. OSHA? WTF is that lol. Law? "If you talk to me about law, I laugh in your face" says the head of a municipal "Rule of Law Office". "Don't talk to me about law this and law that", says the court. But the moment you order a picture of Winnie the pooh carrying wheat (Xi once said he carries 100kg of wheat on his single shoulder) on Alibaba, your account gets banned.
Off topic thoughts: Because CCP has total control, there is no split of power to speak of, so once they are right, they are so right; but when they are wrong, it is catastrophically wrong and there is no change of course. It's why you see 30-50 million people starve to death and an economy miracle within the same half century.
It doesn't really concern the everyman on the street.
The few high profile cases where it was used to punish individuals who ran afoul of some politically powerful person or caused some huge outrage are red herrings - if the system didn't exist, theyd've found some other way to punish them.
https://www.congress.gov/bill/118th-congress/house-bill/1157
https://www.congress.gov/bill/118th-congress/house-bill/1157...
Oh. Had to look it up.
(6) to expose misinformation and disinformation of the Chinese Communist Party’s or the Government of the People’s Republic of China’s propaganda, including through programs carried out by the Global Engagement Center; and
(7) to counter efforts by the Chinese Communist Party or the Government of the People’s Republic of China to legitimize or promote authoritarian ideology and governance models.
——-
Feels like the defense sector is determined to make China a perpetual enemy.
Did the intern post a manifesto or something? What was the point of doing this?
(via https://news.ycombinator.com/item?id=41906970, but we merged that thread hither)
But in this case I'm guessing their incident analysis teams also get an unrelated person added to them, in order to have an outside perspective? Seems confusing to overload the term like that, if that's the case.
I worked at aws for 13 years, was briefly in the reliability org that owns the COE (post incident analysis) tooling, and spent a lot if time on “ops” for about 5 years.
Ive never heard of an individual being terminated or meaningfully punished for making an earnest mistake, regardless of impact. I do know of people who were rapid term’d for malicious, or similar, actions like sharing internal information or (attempting to) subvert security controls.
On the whole I did see Amazon “do the right thing” around improving process and tools; people are a fallible _part_ of a system, accountability requires authority, incremental improvements today over a hypothetical tomorrow.
> an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
As of a while back that entire state management subsystem, which dates from the very beginning of AWS, has been replaced.
Source: me. I was oncall for (some of) the incident management of that event.
you can usually guess from context but what you say is "we need a bar raiser for this hiring loop" or "get a bar raiser for this COE" or "get a bar raiser for the UI", there are qualified bar raisers for each setting.
I wonder if we applied this culture talk to Western companies how funny it would sound.
The reason Facebook is firing so many people is because individualism "is far more important for them than 'teaching lessons' to anyone, particularly employees who are probably considered expendable."
But my understanding of this case is that the actions do not appear like simple easy to make mistakes. As I understand, the claim was that the intern was modifying the weights of checkpoints for other peoples' training results in an effort to make their own work better. Mucking about in a checkpoint is not a very common thing to do, so should make someone suspicious in the first place. On top of this it appears he was exploiting weaknesses and injecting code to mess with peoples' optimizers, and to do things that do not have a reasonable explanation for.
So as far as I can tell, not only was he touching files he shouldn't have been touching (and yes, shouldn't have had access to), he was taking steps to bypass the blocks there were in place and was messing with them in ways that are very difficult to explain away with "I thought this might be a good idea." (Things that explicitly look like a bad idea). If that is what in fact happened, I think it is not a reach to claim intentional sabotage. Because if it wasn't, then the actions are represent such a level of incompetence that they are a huge liability to anyone within reach.
[0] https://www.cia.gov/static/5c875f3ec660e092cf893f60b4a288df/...
This wasn’t an accident, though. The intern had malicious intent and was intentionally trying to undermine other people’s work.
This isn’t a case where blameless post-mortems apply. When someone is deliberately sabotaging other people’s work, they must be evicted from the company.
Old Age and Treachery Beats Youth and Enthusiasm, Every Time.
Looks like this guy tried the “treachery” part, before he had the “old age” part down.100,000 supposes that there are... hmm... about eighty thousand non-evil people in the world, and (odds are) exactly none of them are Marshallese and about 2 are Samoan, to give a sense of how silly this is.
Yes, the intern was actively behaving maliciously, but why? What did he stand to gain from breaking another team's training code? I don't buy that he went through all that effort and espionage simply to make his own work look better. An intern is only employed for 3 months, surely sabotaging another team's multi-year project is not the most efficient way to make your toy 3-month project look better in comparison.
And that wasn’t even a mistake the SDEs made — they were punished for the economists being reckless and subsequently bullied out of the company, despite the SDEs trying to raise the alarm the whole time.
It didn't override safeguards, but they sure wanted you to think that something unusual was done as part of the incident. What they executed was a standard operational command. The problem was, the components that that command interacted with had been creaking at the edges for years by that point. It was literally a case of "when", and not "if". All that happened was the command tipped it over the edge in combination with everything else happening as part of normal operational state.
Engineering leadership had repeatedly raised the risk with further up the chain and no one was willing to put headcount to actually mitigating the problem. If blame was to be applied anywhere, it wasn't on the engineer following the run book that gave them a standard operational command to execute with standard values. They did exactly what they were supposed to.
Some credit where it's due, my understanding from folks I knew still in that space, is that S3 leadership started turning things around after that incident and started taking these risks and operational state seriously.
Naturally, older people had more time to do that than younger people. This is why most young people get their shins blasted while older people just get a slap on the wrist, if they're found out.
I think maybe 1 in 100k is actually anything special, but odds are you aren't special, you just noticed that 20% of the population is as gifted/motivated/constructive as you are (statistically speaking, assuming a bell curve).
And of those, yes, some small percentage will still feel "special" and affronted that other people have the same ideas/goals/desires as them.
It's a rat race and it's not your fault.
https://www.cbsnews.com/news/xu-yao-death-sentence-poisoning...
I’ve heard mixed things about CDO, positive things about AWS, but where I worked in Devices and FinTech were both wild… to the point FinTech (circa 2020) didn’t even use the PRFAQ/6-pager methodology. Much to the surprise of people in CDO I asked for advice.
(There are plenty of people bandwagonning on Musk hate, and definitely some for his political bent, but there are also plenty of totally valid and non-political reasons to have disdain for him)