New AMD EPYC-based Compute Engine family, now in beta

1. mdasen ◴[18 Feb 20 18:02 UTC] No.22358369[source]▶

Since people from Google Cloud are likely here, one thing I'd like to ask/talk about: are we getting too many options for compute? One of the great things about Google Cloud was that it was very easy to order. None of this "t2.large" where you'd have to look up how much memory and CPU that it has and potentially how many credits you're going to get per hour and such. I think Google Cloud is still easier, but it's getting harder to know what is the right direction.

For example, the N2D instances are basically the price of the N1 instances or even cheaper with committed-use discounts. Given that they provide 39% more performance, should the N1 instances be considered obsolete once the N2D exits beta? I know that there could be workloads that would be better on Intel than AMD, but it seems like there would be little reason to get an N1 instance once the N2D exits beta.

Likewise, the N2D has the basically same sustained-use price as the E2 instances (which only have the performance of N1 instances). What's the point of E2 instances if they're the same price? Shouldn't I be getting a discount given that Google can more efficiently use the resources?

It's great to see the improvements at Google Cloud. I'm glad to see lower-cost, high-performance options available. However, I guess I'm left wondering who is choosing what. I look at the pricing and think, "who would choose an N1 or N2 given the N2D?" Sure, there are people with specific requirements, but it seems like the N2D should be the default in my mind.

This might sound a bit like complaining, but I do love how I can just lookup memory and CPU pricing easily. Rather than having to remember name-mappings, I just choose from one of the families (N1, N2, E2, N2D) and can look at the memory and CPU pricing. It makes it really simple to understand what you're paying. It's just that as more families get added and Google varies how it applies sustained-use and committed-use discounts between the families, it becomes more difficult to choose between them.

For example, if I'm going for a 1-year commitment, should I go with an E2 at $10.03/vCPU or an N2D at $12.65/vCPU. The N2D should provide more performance than the 26% price increase, yes? Why can't I get an EPYC based E-series to really drive down costs?

Again, I want to reiterate that Google Cloud's simpler pricing is great, but complications have crept in. E2 machines don't get sustained-use discounts which means they're really only valuable if you're doing a yearly commitment or non-sustained-use. The only time N1 machines are cheaper is in sustained-use - they're the same price as Intel N2 machines if you're doing a yearly commitment or non-sustained-use. Without more guidance on performance differences between the N2D and N2, why should I ever use N2? I guess this is a bit of rambling to say, "keep an eye on pricing complexity - I don't like spending a lot of time thinking about optimizing costs".

replies(11): >>22358433 #>>22358442 #>>22358483 #>>22358724 #>>22358783 #>>22358816 #>>22358852 #>>22359250 #>>22359298 #>>22360053 #>>22360348 #

2. lallysingh ◴[18 Feb 20 18:08 UTC] No.22358433[source]▶

>>22358369 (TP) #

They still own and have to pay for the old hardware.

Customers rarely have the time/energy/expertise to continuously reoptimize their cloud usage.

3. znpy ◴[18 Feb 20 18:08 UTC] No.22358442[source]▶

>>22358369 (TP) #

tl;dr: I find the amplitude of google cloud's offering confusing, so i think there should be less of it.

4. scardycat ◴[18 Feb 20 18:12 UTC] No.22358483[source]▶

>>22358369 (TP) #

Customers like having choices. Enterprises typically will "certify" one config and would like to stay on that till they absolutely need to move to something else.

replies(1): >>22359295 #

5. theevilsharpie ◴[18 Feb 20 18:32 UTC] No.22358724[source]▶

>>22358369 (TP) #

> For example, the N2D instances are basically the price of the N1 instances or even cheaper with committed-use discounts. Given that they provide 39% more performance, should the N1 instances be considered obsolete once the N2D exits beta?

As the name implies, N2 is a newer generation than N1. I don't think Google has announced any official N1 deprecation timeline, but that product line clearly has an expiration date.

The more direct comparison would be Intel's N2 instances, vs. AMD's N2D instances. In that case, N2 instances are likely faster on a per-core basis and support some Intel-specific instructions, whereas N2D instances are substantially less expensive.

> Again, I want to reiterate that Google Cloud's simpler pricing is great, but complications have crept in.

That seems like an unavoidable consequence of maturing as a product offering: more options means more complexity. If Google tried to streamline everything and removed options to keep things simple, they'd have another cohort of users (including myself) screaming that the product doesn't meet their needs.

I suppose a "Help Me Choose" wizard that provides some opinionated guidance can be helpful to onboarding new users, but otherwise, I don't see how Google can win here.

replies(2): >>22358792 #>>22359535 #

6. ◴[18 Feb 20 18:37 UTC] No.22358783[source]▶

>>22358369 (TP) #

7. wmf ◴[18 Feb 20 18:37 UTC] No.22358792[source]▶

>>22358724 #

They should just hide everything besides Rome under an "advanced" UI. ;-)

8. kccqzy ◴[18 Feb 20 18:39 UTC] No.22358816[source]▶

>>22358369 (TP) #

I think for enterprise businesses, people just love choices. I don't know about GCP but I do know about highly paid AWS consultants producing detailed comparisons between instance types and make recommendations for companies to "save money." Or maybe some people just like the thrill of using spreadsheets and navigating the puzzle of pricing.

replies(1): >>22359338 #

9. rb808 ◴[18 Feb 20 18:42 UTC] No.22358852[source]▶

>>22358369 (TP) #

Cloud is a poor metaphor now. Its really a messy bunch of constellations where some people think they can see a pretty picture, most people just see random dots.

10. outworlder ◴[18 Feb 20 19:14 UTC] No.22359250[source]▶

>>22358369 (TP) #

> are we getting too many options for compute

As compared to what, Azure? :)

11. adamc ◴[18 Feb 20 19:18 UTC] No.22359295[source]▶

>>22358483 #

That reflects the lumbering, bureaucratic nature of enterprises.

replies(4): >>22359401 #>>22359583 #>>22361276 #>>22363929 #

12. TuringNYC ◴[18 Feb 20 19:18 UTC] No.22359298[source]▶

>>22358369 (TP) #

Different chipsets may have slightly different capabilities. For example, I’ve been using NVIDIA RAPIDS recently. Not all NVIDIA cards support this particular framework’s needs. Sometimes you need to specifically direct customer installations to a specific type of card or chipset.

13. TuringNYC ◴[18 Feb 20 19:21 UTC] No.22359338[source]▶

>>22358816 #

Or they prefer the certainty of being able to test software on specific hardware setups and be able to give customers higher levels of confidence.

14. scardycat ◴[18 Feb 20 19:26 UTC] No.22359401{3}[source]▶

>>22359295 #

Sure, I am sure they have something to say about how smaller internet companies move fast and break things with no concern for how it affects customers. I am sure you would rather have your bank just work than have some nifty tool with insufficient testing and that flakes out every other day.

Everyone to their own, I am just stating that there is a need.

15. mdasen ◴[18 Feb 20 19:39 UTC] No.22359535[source]▶

>>22358724 #

> If Google tried to streamline everything...they'd have another cohort of users screaming that the product doesn't meet their needs

Except that they could simplify it without reducing flexibility.

For example, the difference between E-series and N-series is that E-series instances have the whole balancing thing. Instead of being a different instance type, it could be simplified into an option on available types and it would just give you a discount.

Likewise, some of it is about consistency. How much should sustained-usage give you a discount? 20%? 30%? 0%? There seems to be little difference to Google whether sustained-use an E2, N2, N2D, or N1 in terms of their costs and planning and yet the discount varies a lot.

It's not about fewer choices. It's more that the choices aren't internally consistent. N2 instances are supposed to be simply superior to N1 instance, but N1 instances cost the same as N2 instances for 1-year contract, 3-year contract, and on-demand. They're only more expensive for sustained-use which seems odd. Likewise, E2 instances are meant to give you a discount and they do give you a discount for 3 out of the 4 usage scenarios. The point is that there's no real reason for the pricing not to be consistent across the 4 usage scenarios (1-year, 3-year, on-demand, and sustained-use). That's where the complexity creeps in.

It's really easy to look and say, "ok, I have E2, N2D, and N2 instances in ascending price order and I can choose what I want." Except that the pricing doesn't work out consistently.

> N2 instances are likely faster on a per-core basis

Are they meant to be? Google's announcement makes it seem like they should be equivalent: "N2D instances provide savings of up to 13% over comparable N-series instances".

The point I'm trying to make isn't that they shouldn't offer choice. It's that the choice should be consistent to be easily understandable. E2 instances should offer a consistent discount. If N2 machines are the same price as N1 machines across 3 usage scenarios, they should be the same price across all 4. When you browse the pricing page, you can get into situations where you start thinking, "ok, the N1 instances are cheaper so do I need the improvements of the N2?" And then you start looking and you're like, "wait, the N2s are the same price....oh, just the same price most of the time." Then you start thinking, "I can certainly deal with the E2's balancing...oh, but it's the same price...well, it's cheaper except for sustained-use".

There doesn't seem to be a reason why sustained-use on N1s should be cheaper for Google than sustained-use on N2s. There doesn't seem to be a reason why sustained-use on E2s offers no discount - especially given that the 1-year E2 price offers the same 37% discount that the N1s offer.

It would be nice to go to the page and say, "I'm going with E2s." However, I go to the page and it's more like, "I'm going with E2s when I am going to do a 1-year commitment, but I'm going with N2Ds when I'm doing sustained-use without a commitment since those are the same price for better hardware with seemingly no reason and the N1s are just equal or more expensive so why don't they just move them to a 'legacy machine types' page". It's the inconsistency in the pricing for seemingly no reason that makes it tough, not the options. The fact that N2Ds are the same monthly price as E2s for sustained-use, but E2s are significantly cheaper in all other scenarios is the type of complexity that's the annoying bit.

EDIT: As an example, E2 instances are 20.7% cheaper on-demand, 20.7% cheaper with 1-year commitment, and 20.7% cheaper with 3-year commitment compared to N2D instances. That's wonderful consistency. Then we look at sustained use and it's 0.9% cheaper with no real explanation why. It's a weird pricing artifact that means that you aren't choosing, "this is the correct machine for the price/performance balance I'm looking for" but rather you're balancing three things: price, performance, and billing scenario.

replies(1): >>22361209 #

16. gabrielfv ◴[18 Feb 20 19:43 UTC] No.22359583{3}[source]▶

>>22359295 #

Any company has a problem, which they size the value that it has to be tackled for, so they set budget, measure options, raise possible trade-offs, eventually succeed at dealing with it or get bitten due to lack of proper future-proofing, not evaluating environment/requirements properly, rinse and repeat. Once you've got several problems that require this kind of approach, carefully handpicking the possibly-but-not-proven best solution is not only time consuming but might lead to potentially awfully impactful consequences. When the decision to switch cloud provider services to one that may get you offline for half an hour leading to a million-dollar revenue impact, that's when we're talking enterprise.

replies(1): >>22376347 #

17. 013a ◴[18 Feb 20 20:30 UTC] No.22360053[source]▶

>>22358369 (TP) #

Realistically; a typical hyperscale cloud provider has tens/hundreds of millions of dollars invested into a specific CPU platform. It makes very little sense to just throw it out chasing some idealism like "simplicity"; the world is not simple.

You can be like Digitalocean and just say "You want a CPU core, you get a CPU core, no guarantee what it'll be". Most enterprises won't buy this. But, I think there's some interesting use-cases where even a hyperscale provider targeting enterprises could (and do) utilize this; not on an EC2-like product, but as the infrastructure for something like Lambda, or to run the massive number of internal workloads necessary to power highly-managed cloud workloads.

18. boulos ◴[18 Feb 20 20:59 UTC] No.22360348[source]▶

>>22358369 (TP) #

Disclosure: I work on Google Cloud (and really care about this).

The challenge here is balancing diverse customer workloads against the processor vendors. Historically, at Google, we just bought a single server variant (basically) because almost all code is expected to care primarily about scale-out environments. That made the GCE decision simple: offer the same hardware we build for Google, at great prices.

The problem is that many customers have workloads and applications that they can’t just change. No amount of rational discounting or incentives makes a 2 GHz processor compete with a 4 GHz processor (so now, for GCE, we buy some speedy cores and call that Compute Optimized). Even more strongly, no amount of “you’re doing it wrong” actually is the right answer for “I have a database on-prem that needs several sockets and several TB of memory” (so, Memory Optimized).

There’s an important reason though that we refer to N1, N2, N2D, and E2 as “General purpose”: we think they’re a good balanced configuration, and they’ll continue to be the right default choice (and we default to these in the console). E2 is more like what we do internally at Google, by abstracting away processor choice, and so on. As a nit to your statement above, E2 does flip between Intel and AMD.

You should choose the right thing for your workloads, primarily subject to the Regions you need them in. We’ll keep trying to push for simplicity in our API and offering, but customers really do have a wide range of needs, which imposes at least some minimum amount of complexity. For too long (probably) we attempted to refuse, because of complexity, both for us and customers. Feel free to ignore it though!

replies(5): >>22360916 #>>22361132 #>>22361552 #>>22363902 #>>22365298 #

19. milesward ◴[18 Feb 20 22:11 UTC] No.22360916[source]▶

>>22360348 #

We need some kinda shortcut: like, run your app for a few days on an instance, we chew your stackdriver metrics, we make a new shortcut n3-mybestinstance, which picks the right shape/processor family etc for yah.

replies(1): >>22360973 #

20. elithrar ◴[18 Feb 20 22:17 UTC] No.22360973{3}[source]▶

>>22360916 #

As a Googler: take VM rightsizing recommendations - "save $X because you're underutilizing this machine shape" - and extend them to encompass this by including VM-family swaps based on underlying VM metrics? :)

21. alfalfasprout ◴[18 Feb 20 22:35 UTC] No.22361132[source]▶

>>22360348 #

I mean, this mentality often is wrong. Scaling out actually isn't the right solution for everyone. It works for Google given that primarily web services are offered. It does not work for workloads that heavily rely on the CPU (think financial workloads, ML, HPC/scientific workloads) or have realtime requirements. In fact, for many ETL workloads vertical scaling proves far more efficient.

It's long been the "google way" to try and abstract out compute but it's led to an industry full of people trying to follow in their way and overcomplicating what can be solved on one or two machines.

replies(3): >>22361446 #>>22363919 #>>22364549 #

22. ◴[18 Feb 20 22:43 UTC] No.22361209{3}[source]▶

>>22359535 #

23. bcrosby95 ◴[18 Feb 20 22:53 UTC] No.22361276{3}[source]▶

>>22359295 #

Lumbering enterprises that spend billions of dollars on IT.

24. erulabs ◴[18 Feb 20 23:15 UTC] No.22361446{3}[source]▶

>>22361132 #

Except, almost without exception, eventually the one or two machines will fall over. Ideally you can engineer your way around this ahead of time - but not always. Fundamentally relying on a few specific things (or people) will always be an existential risk to a big firm. Absolutely agree re: start small - but the problem with “scale out” is a lack of good tooling - not a fundamental philosophical one.

replies(5): >>22361542 #>>22362219 #>>22362898 #>>22363481 #>>22363618 #

25. aidenn0 ◴[18 Feb 20 23:28 UTC] No.22361542{4}[source]▶

>>22361446 #

Plenty of services can deal with X hours of downtime when a single machine fails for values of X that are longer than it takes to restore to a new machine from backups.

replies(2): >>22362715 #>>22364102 #

26. mdasen ◴[18 Feb 20 23:30 UTC] No.22361552[source]▶

>>22360348 #

This makes a lot of sense, but it doesn't explain why the pricing isn't consistent. Why is an N1 the same price as an N2, except for sustained-use? Why is an E2 cheaper than an N1/N2D, except for sustained-use?

E2 is just such an amazing idea that feels like it's going to be under-utilized because it isn't cheaper for the sustained-use case. There doesn't seem to be any reason why E2 would be more expensive (to Google) for sustained-use and not for on-demand or committed.

Google Cloud is really nice, but the inconsistent pricing/discounting between the different types seems odd. Like, I'm running something on N1 right now with sustained-use because there's no incentive for me to switch to E2. It feels a bit wasteful since it doesn't get a lot of traffic and would be the perfect VM to steal resources from. However, I'd only get a discount if I did a 1-year commitment. For Google, I'm tying up resources you could put to better use. E2 instances are usually 30% cheaper which would give me a nice incentive to switch to them, but without the sustained-use discount, N2D and N1 instances become the same price. So, I end up tying up hardware that could be used more efficiently.

replies(1): >>22362316 #

27. alfalfasprout ◴[19 Feb 20 01:19 UTC] No.22362219{4}[source]▶

>>22361446 #

It is a philosophical one when you design around scaling out at a high rate. You incur significant additional complexity in many cases along with increased overhead.

It's fallacious to think that relying on "n" things is strictly safer than "3" things where n is large. That's not quite true due to the significant complexity increases when dealing with large "n" and accompanying overhead.

For web applications (which I suspect the majority of HN readers work on) then sure, but plenty of realtime or safety critical applications are perfectly ok with three-way redundancy.

replies(1): >>22364859 #

28. throwaway2048 ◴[19 Feb 20 01:33 UTC] No.22362316{3}[source]▶

>>22361552 #

Pricing confusion is a cornerstone of big single shop vendors, the more confusing you make pricing, the more chances a customer is going to spend more than they otherwise might.

Also opens avenues for highly paid consultants to dip their beak and promote your products.

29. chucky_z ◴[19 Feb 20 03:03 UTC] No.22362715{5}[source]▶

>>22361542 #

I'd like to add to this and say that a server being down for 6 hours, that if over the life of its uptime (months? years?) saves uncountable number of hours on computations and complexity, is so worth it.

Heck, even a machine like that being down for a week is usually still worth it.

30. PudgePacket ◴[19 Feb 20 03:49 UTC] No.22362898{4}[source]▶

>>22361446 #

I've heard lots of anecdotes from big sites doing fine with a small number of machines.

eg stackoverflow only has 1 active DB server, and 1 backup.

https://stackexchange.com/performance

replies(1): >>22363499 #

31. lmeyerov ◴[19 Feb 20 06:23 UTC] No.22363481{4}[source]▶

>>22361446 #

Totally. We could replace our GPU stack with who knows how many CPUs to hit the same 20ms SLAs, and we'll just pretend data transfer overhead doesn't exist ;-)

More seriously, we're adding multi-node stuff for isolation and multi-GPU for performance. Both are quite different... and useful!

32. endymi0n ◴[19 Feb 20 06:28 UTC] No.22363499{5}[source]▶

>>22362898 #

Actually, this makes a lot of sense. Reasoning about a single machine is just way simpler and keeps the full power of a modern transactional database at your fingertips. Backups keep taking longer and disaster recovery isn‘t as speedy anymore, but we‘re running on Postgres internally as well and I‘d scale that server as big as possible (even slightly beyond linear cost growth, which is pretty huge these days) before even thinking about alternatives.

33. mrich ◴[19 Feb 20 07:05 UTC] No.22363618{4}[source]▶

>>22361446 #

The solution often is to have a warm standby that can take over immediately. You do not get any distributed overhead that is present in a fully load-balanced system during normal operation, and only pay a small amount in the very exceptional failure case.

34. cm2187 ◴[19 Feb 20 08:26 UTC] No.22363902[source]▶

>>22360348 #

I presume the downside of a fragmented offering is that it is easier to run out of stock of a particular type of configuration. Does that happen much on any of the major clouds? Ie if we script the provisioning of a new VM, is it something to watch for or is it a really rare event?

35. cm2187 ◴[19 Feb 20 08:30 UTC] No.22363919{3}[source]▶

>>22361132 #

And even when it could make sense, the cost of redeveloping a system, validating its logic, ensuring it is production ready is often prohibitive compare to the cost of a bigger server (and a risk a business may not be willing to take).

36. cm2187 ◴[19 Feb 20 08:33 UTC] No.22363929{3}[source]▶

>>22359295 #

Until you are at an airline counter desk not able to board your flight home because of an IT meltdown.

Enterprises have these processes not just because they like bureaucracy (though they also often like bureaucracy too).

37. darkwater ◴[19 Feb 20 09:20 UTC] No.22364102{5}[source]▶

>>22361542 #

I do not agree, and it is not my experience.Mind you that I've always worked in small/mid-sized businesses (50-300 employees) and basically every service has someone needing it for their daily work. Sure, they may live without it for some times, but you will make their lives more miserable.

And anyway if you already have all in place to completely rebuild every SPOF machine from scratch in a few hours, go the extra mile and make it an active/passive cluster, even manually switched, and make the downtime a minutes thing.

replies(1): >>22376467 #

38. nemothekid ◴[19 Feb 20 11:06 UTC] No.22364549{3}[source]▶

>>22361132 #

Google doesn’t have ML workloads (Basically all of search) or real-time (basically all of RTB) requirements?

I agree not everyone can develop like Google, but it’s wrong to say that “it doesn’t work”

39. Bad_CRC ◴[19 Feb 20 12:21 UTC] No.22364859{5}[source]▶

>>22362219 #

I work a lot with voip systems and it's much much easier to have one big machine that trying to make it work distributed...

40. merb ◴[19 Feb 20 13:40 UTC] No.22365298[source]▶

>>22360348 #

well the biggest problem is that commited usage discount is not replaceable with the cheaper e2 choices. btw. when we commited our usage to n1, e2 was not available.

41. adamc ◴[20 Feb 20 16:46 UTC] No.22376347{4}[source]▶

>>22359583 #

I work in a big lumbering organization. Sure, there is some truth to that. But there is also just a lot of suboptimal decision making, because big organizations have complicated politics and policies, and it is often easier to keep coasting than do something better.

42. aidenn0 ◴[20 Feb 20 16:59 UTC] No.22376467{6}[source]▶

>>22364102 #

A small amount of work over a long period of time (i.e. setting up a redundant system) may be worse than losing a large amount of work in a short period of time.

Single machines just don't fail that often. I managed a database server for an internal tool and the machine failed once in about 10 years. It was commodity hardware, so I just restored the backups to a spare workstation and it was back up in less than 2 hours. 15 people used this service and they could get some work done, without it, so there was less than 30 person-hours of productivity lost. If I spent 30 hours getting failover &c. working for this system over a 10 year period, it would have been more hours lost for the company than the failure caused.