Most active commenters

grogenaut(4)
mlyle(3)

Popular/hot comments

>>41900811 #
>>41900913 #
>>41901855 #

←back to thread

ByteDance sacks intern for sabotaging AI project

(www.bbc.com)

Show context

peterkos ◴[21 Oct 24 04:04 UTC] No.41900587[source]▶

>>41900402 (OP) #

I'm reminded of a time that an intern took down us-east1 on AWS, by modifying a configuration file they shouldn't have had access to. Amazon (somehow) did the correct thing and didn't fire them -- instead, they used the experience to fix the security hole. It was a file they shouldn't have had access to in the first place.

If the intern "had no experience with the AI lab", is it the right thing to do to fire them, instead of admitting that there is a security/access fault internally? Can other employees (intentionally, or unintentionally) cause that same amount of "damage"?

replies(12): >>41900622 #>>41900627 #>>41900641 #>>41900805 #>>41900919 #>>41901069 #>>41901814 #>>41903916 #>>41909887 #>>41910021 #>>41910134 #>>41910235 #

1. grogenaut ◴[21 Oct 24 04:15 UTC] No.41900641[source]▶

>>41900587 #

From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up. Usually that person is the last in a long series of decisions that could have prevented the issue, and thus why blame them. That is unless the person is a) acting with malice, b) is repeatedly shown a pattern of willful ignorance. IIRC, when one person took down S3 with a manual command overriding the safeguards the action was not to fire them but to figure out why it was still a manual process without sign off. Say what you will about Amazon culture, the ability to make mistakes or call them out is pretty consistently protected.

replies(5): >>41900811 #>>41901212 #>>41911207 #>>41914419 #>>41915916 #

2. tgavVs ◴[21 Oct 24 04:56 UTC] No.41900811[source]▶

>>41900641 (TP) #

> From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up

Interesting that my experience has been the exact opposite.

Whenever I’ve participated in COE discussions (incident analysis), questions have been focused on highlighting who made the mistake or who didn’t take the right precautions.

replies(5): >>41900843 #>>41900913 #>>41901176 #>>41901751 #>>41902023 #

3. dockerd ◴[21 Oct 24 05:04 UTC] No.41900843[source]▶

>>41900811 #

That was not the idea of COE ever. Probably you were in bad org/team.

replies(1): >>41909859 #

4. grogenaut ◴[21 Oct 24 05:25 UTC] No.41900913[source]▶

>>41900811 #

I've bar raised a ton of them. You do end up figuring out what actions by what operator caused what issues or didn't work well, but that's to diagnose what controls/processes/tools/metrics were missing. I always removed the actual people's name as part of the bar raising, well before publishing, usually before any manager sees it. Instead used Oncall 1, or Oncall for X team, Manager for X team. And that's mainly for the timeline.

As a sibling said you were likely in a bad or or one that was using COEs punatively.

replies(3): >>41901015 #>>41901855 #>>41909919 #

5. mlyle ◴[21 Oct 24 05:48 UTC] No.41901015{3}[source]▶

>>41900913 #

In the article's case, there's evidence of actual malice, though-- sabotaging only large jobs, over a month's time.

replies(1): >>41901174 #

6. fragmede ◴[21 Oct 24 06:21 UTC] No.41901174{4}[source]▶

>>41901015 #

All I got from the linked article was

> TikTok owner, ByteDance, says it has sacked an intern for "maliciously interfering" with the training of one of its artificial intelligence (AI) models.

Are there other links with additional info?

replies(1): >>41901326 #

7. geon ◴[21 Oct 24 06:21 UTC] No.41901176[source]▶

>>41900811 #

Isn't that a necessary step in figuring out the issue and how t prevent it?

8. evanextreme ◴[21 Oct 24 06:28 UTC] No.41901212[source]▶

>>41900641 (TP) #

At least in my experience, this is also how Azure continues to function. Certainly reduces stress in the working environment

9. mlyle ◴[21 Oct 24 06:44 UTC] No.41901326{5}[source]▶

>>41901174 #

A lot of the original social media sources have been pulled, but this is what was alleged on social media:

https://juejin.cn/post/7426926600422637594

https://github.com/JusticeFighterDance/JusticeFighter110

https://x.com/0xKyon/status/1847529300163252474

replies(1): >>41901343 #

10. fragmede ◴[21 Oct 24 06:47 UTC] No.41901343{6}[source]▶

>>41901326 #

Thanks. Google translate off the first link:

> He exploited the vulnerability of huggingface's load ckpt function to inject code, dynamically modifying other people's optimizer to randomly sleep for a short period of time, and modifying the direction of parameter shaving. He also added a condition that only tasks with more than 256 cards would trigger this condition.

Okay yeah that's malicious and totally a crime. "modifying the direction of parameter shaving" means he subtly corrupted his co-workers work. that's wild!

replies(2): >>41901370 #>>41911851 #

11. mlyle ◴[21 Oct 24 06:50 UTC] No.41901370{7}[source]▶

>>41901343 #

Some of the sources say that he sat in the incident meetings during troubleshooting and adjusted his attacks to avoid detection, too.

replies(2): >>41904131 #>>41909548 #

12. sokoloff ◴[21 Oct 24 08:01 UTC] No.41901751[source]▶

>>41900811 #

I’ve run the equivalent process at my company and I absolutely want us to figure out who took the triggering actions, what data/signals they were looking at, what exactly they did, etc.

If you don’t know what happened and can’t ask more details about it, how can you possibly reduce the likelihood (or impact) of it in the future?

Finding out in detail who did it does not require you to punish that person and having a track record of not punishing them helps you find out the details in future incidents.

replies(1): >>41901947 #

13. aitchnyu ◴[21 Oct 24 08:17 UTC] No.41901855{3}[source]▶

>>41900913 #

Whats bar raising in this context?

replies(3): >>41902095 #>>41909849 #>>41915930 #

14. ◴[21 Oct 24 08:39 UTC] No.41901947{3}[source]▶

>>41901751 #

15. Cthulhu_ ◴[21 Oct 24 08:51 UTC] No.41902023[source]▶

>>41900811 #

But when that person was identified, were they personally held responsible, bollocked, and reprimanded or were they involved in preventing the issue from happening again?

"No blame, but no mercy" is one of these adages; while you shouldn't blame individuals for something that is an organization-wide problem, you also shouldn't hold back in preventing it from happening again.

replies(1): >>41907147 #

16. bspammer ◴[21 Oct 24 09:05 UTC] No.41902095{4}[source]▶

>>41901855 #

https://www.aboutamazon.co.uk/news/working-at-amazon/what-is...

17. justinclift ◴[21 Oct 24 13:48 UTC] No.41904131{8}[source]▶

>>41901370 #

Wonder what the underlying motive was? Seems like a super weird thing to do.

replies(1): >>41910140 #

18. grogenaut ◴[21 Oct 24 18:50 UTC] No.41907147{3}[source]▶

>>41902023 #

Usually helping prevent the issue, training. Almost everyone I've ever seen cause an outage is so "oh shit oh shit oh shit" that a reprimand is worthless, I've spent more time a) talking them through what they could have done better and, encouraging them to escalate quicker b) assusaging their fears that it was all their fault and they'll be blamed / fired. "I just want you to know we don't consider this your fault. It was not your fault. Many many people made poor risk tradeoffs for us to get to the point where you making X trivial change caused the internet to go down"

In some cases like interns we probably just took their commit access away or blocked their direct push access. Now a days interns can't touch critical systems and can't push code directly to prod packages.

19. NetOpWibby ◴[21 Oct 24 23:15 UTC] No.41909548{8}[source]▶

>>41901370 #

LMAO that's just diabolical. Wonder what motivated them.

20. kelnos ◴[22 Oct 24 00:04 UTC] No.41909849{4}[source]▶

>>41901855 #

Usually I hear it in the context of a person outside the team added to an interview panel, to help ensure that the hiring team is adhering to company-wide hiring standards, not the team's own standards, where they may differ.

But in this case I'm guessing their incident analysis teams also get an unrelated person added to them, in order to have an outside perspective? Seems confusing to overload the term like that, if that's the case.

replies(1): >>41910088 #

21. kelnos ◴[22 Oct 24 00:05 UTC] No.41909859{3}[source]▶

>>41900843 #

Or maybe you were in an unusually good team?

I always chuckle a little when the response to "I had a bad experience" is "I didn't, so you must be an outlier".

replies(1): >>41909958 #

22. donavanm ◴[22 Oct 24 00:16 UTC] No.41909919{3}[source]▶

>>41900913 #

As I recall the coe tool “automated reviewer” checks cover this. It should flag any content that looks like a person (or customer name) before the author submits it.

23. donavanm ◴[22 Oct 24 00:23 UTC] No.41909958{4}[source]▶

>>41909859 #

No. The majority of teams and individuals are using it as intended, to understand and prevent future issues from process and tool defects. The complaints Ive heard are usually correlated with other indicators of a “bad”/punitive team culture, a lower level IC not understanding process or intent, or shades of opinion like “its a lot of work and I dont see the benefit. Ergo its malicious or naive.”

I worked at aws for 13 years, was briefly in the reliability org that owns the COE (post incident analysis) tooling, and spent a lot if time on “ops” for about 5 years.

24. grogenaut ◴[22 Oct 24 00:43 UTC] No.41910088{5}[source]▶

>>41909849 #

They are the same role different specialties. Like saying SDE for ML or for Distributed Systems or Clients.

you can usually guess from context but what you say is "we need a bar raiser for this hiring loop" or "get a bar raiser for this COE" or "get a bar raiser for the UI", there are qualified bar raisers for each setting.

25. tyingq ◴[22 Oct 24 00:55 UTC] No.41910140{9}[source]▶

>>41904131 #

Could be just so his work looked better in comparison. Or something more sinister, like being paid to slow progress.

26. Twirrim ◴[22 Oct 24 04:29 UTC] No.41911207[source]▶

>>41900641 (TP) #

> when one person took down S3 with a manual command overriding the safeguards

It didn't override safeguards, but they sure wanted you to think that something unusual was done as part of the incident. What they executed was a standard operational command. The problem was, the components that that command interacted with had been creaking at the edges for years by that point. It was literally a case of "when", and not "if". All that happened was the command tipped it over the edge in combination with everything else happening as part of normal operational state.

Engineering leadership had repeatedly raised the risk with further up the chain and no one was willing to put headcount to actually mitigating the problem. If blame was to be applied anywhere, it wasn't on the engineer following the run book that gave them a standard operational command to execute with standard values. They did exactly what they were supposed to.

Some credit where it's due, my understanding from folks I knew still in that space, is that S3 leadership started turning things around after that incident and started taking these risks and operational state seriously.

27. yorwba ◴[22 Oct 24 06:53 UTC] No.41911851{7}[source]▶

>>41901343 #

"parameter shaving" (参数剃度) is, by the way, a typo for "parameter gradient" (参数梯度), 梯度 being the gradient and 剃度 being a tonsure.

28. DrillShopper ◴[22 Oct 24 14:05 UTC] No.41914419[source]▶

>>41900641 (TP) #

It's a shame that they're so bad at (physically) delivering their products these days.

29. notyourwork ◴[22 Oct 24 16:24 UTC] No.41915916[source]▶

>>41900641 (TP) #

Precisely, if you ship if, you own it. So ownership isn’t the individual but rather the team and company. Blaming a human for an error that at least another engineer likely code reviewed, a team probably discussed prioritizing and eventually lead to degradation is a poor way to prevent it from happening again.

30. notyourwork ◴[22 Oct 24 16:25 UTC] No.41915930{4}[source]▶

>>41901855 #

Bar raisers for COE are those who review the document for detail, resolution, detailed root cause and a clear set of action items to prioritize which will eliminate or reduce chance or reoccurrence.

It’s a person with experience.

↑