Popular/hot comments

(www.sportingnews.com)

Show context

softwaredoug ◴[16 Nov 24 17:53 UTC] No.42157774[source]▶

The way to deal with this is to constantly do live events, and actually build organizational muscle. Not these massive one off events in an area the tech team has no experience in.

replies(9): >>42158542 #>>42158774 #>>42158782 #>>42158854 #>>42158930 #>>42159942 #>>42160430 #>>42160978 #>>42168444 #

mbrumlow ◴[16 Nov 24 19:56 UTC] No.42158774[source]▶

>>42157774 #

I have this argument a lot in tech.

We should always be doing (the thing we want to do)

Somme examples that always get me in trouble (or at least big heated conversations)

1. Always be building: It does not matter if code was not changed, or there has been no PRs or whatever, build it. Something in your org or infra has likely changed. My argument is "I would rather have a build failure on software that is already released, than software I need to release".

2. Always be releasing: As before it does not matter if nothing changed, push out a release. Stress the system and make it go through the motions. I can't tell you how many times I have seen things fail to deploy simply because they have not attempted to do so in some long period of time.

There are more just don't have time to go into them. The point is if "you did it, and need to do it again ever in the future, then you need to continuously do it"

replies(6): >>42158807 #>>42158896 #>>42159793 #>>42159969 #>>42161140 #>>42161623 #

skybrian ◴[16 Nov 24 22:08 UTC] No.42159793[source]▶

>>42158774 #

Doing dry runs regularly makes sense, but whether actually shipping it makes sense seems context-dependent. It depends on how much you can minimize the side effects of shipping a release.

Consider publishing a new version of a library: you'd be bumping the version number all the time and invalidating caches, causing downstream rebuilds, for little reason. Or if clients are lazy about updating, any two clients would be unlikely to have the same version.

Or consider the case when shipping results in a software update: millions of customer client boxes wasting bandwidth downloading new releases and restarting for no reason.

Even for a web app, you are probably invalidating caches, resulting in slow page loads.

With enough work, you could probably minimize these side effects, so that releasing a new version that doesn't actually change anything is a non-event. But if you don't invalidate the caches, you're not really doing a full rebuild.

So it seems like there's a tension between doing more end-to-end testing and performance? Implementing a bunch of cache levels and then not using it seems counterproductive.

replies(4): >>42159844 #>>42160284 #>>42161483 #>>42161780 #

1. lxgr ◴[16 Nov 24 22:14 UTC] No.42159844[source]▶

>>42159793 #

It's very hard to do a representative dry run when the most likely potential points of failure are highly load-dependent.

You can try and predict everything that'll happen in production, but if you have nothing to extrapolate from, e.g. because this is your very first large live event, the chances of getting that right are almost zero.

And you can't easily import that knowledge either, because your system might have very different points of failure than the ones external experts might be used to.

replies(2): >>42159985 #>>42160723 #

2. ◴[16 Nov 24 22:32 UTC] No.42159985[source]▶

>>42159844 (TP) #

3. leptons ◴[17 Nov 24 00:04 UTC] No.42160723[source]▶

>>42159844 (TP) #

They could have done a dry run. They could have spun up a million virtual machines somewhere, and tested their video delivery for 30 minutes. Even my small team spins up 10,000 EC2 instances on the regular. Netflix has the money to do much more. I'm sure there are a dozen ways they could have stress-tested this beforehand. It's not like someone sprang this on them last week and they had to scramble to put together a system to do it.

replies(4): >>42160912 #>>42161245 #>>42164365 #>>42164503 #

4. mewpmewp2 ◴[17 Nov 24 00:37 UTC] No.42160912[source]▶

>>42160723 #

Maybe they did. We don't know they did not. But problem is that real world traffic will still be always totally different, varied and dynamic in many unexpected ways and a certain link might be under certain effect causing a ripple effect.

5. lxgr ◴[17 Nov 24 01:36 UTC] No.42161245[source]▶

>>42160723 #

How representative is an EC2 instance in a datacenter simulating user behavior really, though?

These would likely have completely different network connectivity and usage patterns, especially if they don't have historical data distributions to draw from because this was their first big live event.

replies(1): >>42161881 #

6. leptons ◴[17 Nov 24 04:03 UTC] No.42161881{3}[source]▶

>>42161245 #

>How representative is an EC2 instance in a datacenter simulating user behavior really, though?

Systemic issues causing widespread buffering isn't "user behavior". It's a problem with how Netflix is trying to distribute video. Sure some connections aren't up to the task, and that isn't something Netflix can really control unless they are looking to improve how their player falls-back to lower bitrate video, which could also be tested.

>because this was their first big live event.

That's the point of testing. They should have already had a "big live event" that nobody paid for during automated testing. Instead they seem to have trusted that their very smart and very highly paid developers wouldn't embarrass them based on nothing more than expectations, but they failed. They could have done more rigorous "live" testing before rolling this out to the public.

7. throwaway2037 ◴[17 Nov 24 14:31 UTC] No.42164365[source]▶

>>42160723 #

    > Even my small team spins up 10,000 EC2 instances on the regular.

Woah, this sounds very cool. Can you share more details?

replies(1): >>42166996 #

8. vel0city ◴[17 Nov 24 14:57 UTC] No.42164503[source]▶

>>42160723 #

1) You don't know if they did or did not do this kind of testing. I don't see any proof either way here. You're assuming they didn't.

2) You're assuming whatever issue happened would have been caught by testing on generic EC2 instances in AWS. In the end these streams were going to users on tons of different platforms in lots of different network environments, most of which look nothing like an EC2 instance. Maybe there was something weird with their networking stack on TCL Roku TVs that ended up making network connections reset rapidly chewing up a lot of network resources which led to other issues. What's the EC2 instance type API name for a 55" TCL Roku TV from six years ago on a congested 2.4GHz Wireless N link?

I don't know what happened in their errors. I do know I don't have enough information to say what tests they did or did not run.

9. leptons ◴[17 Nov 24 20:36 UTC] No.42166996{3}[source]▶

>>42164365 #

I manage ~3000 customized websites based on the same template code. Sometimes we make changes to the template code that could affect the customizations - it is practically impossible to predict what might cause a problem due to the nature of the customizations. We'll take before and after screenshots of every page on every site, so it can get into the 100s of thousands of screenshots. We'll then run a diff on the screenshots to see what changed, reviewing the screenshots with the most significant changes. Then we'll address the problems we find and deploy the fixed release.

When we do these large screenshot operations, the EC2 instances are running for maybe 15 or 20 minutes total. It's not exactly cheap, but losing clients because we broke their site is something we want to avoid. The sites are hosted on a 3rd party service, and we're rate-limited by IP address, so to get this done in a reasonable amount of time we need to spin up 10,000 EC2 instances to distribute the work. We have our own software to manage the EC2 instances. It's honestly pretty simple, but effective.

↑

Netflix buffering issues: Boxing fans complain about Jake Paul vs. Mike Tyson