Most active commenters
  • gen220(4)
  • ddorian43(4)

←back to thread

From S3 to R2: An economic opportunity

(dansdatathoughts.substack.com)
274 points dangoldin | 11 comments | | HN request time: 0.001s | source | bottom
1. thedaly ◴[] No.38119756[source]
> In fact, there’s an opportunity to build entire companies that take advantage of this price differential and I expect we’ll see more and more of that happening.

Interesting. What sort of companies can take advantage of this?

replies(3): >>38120302 #>>38120310 #>>38121409 #
2. diamondap ◴[] No.38120302[source]
Basically any company offering special services that work with very large data sets. That could be a consumer backup system like Carbonite or a bulk photo processing service. In either case, legal agreements with customers are key, because you ultimately don't control the storage system on which your business and their data depend.

I work for a non-profit doing digital preservation for a number of universities in the US. We store huge amounts of data in S3, Glacier and Wasabi, and provide services and workflows to help depositors comply with legal requirements, access controls, provable data integrity, archival best practices, etc.

There are some for-profits in this space as well. It's not a huge or highly profitable space, but I do think there are other business opportunities out there where organizations want to store geographically distributed copies of their data (for safety) and run that data through processing pipelines.

The trick, of course, is to identify which organizations have a similar set of needs and then build that. In our case, we've spent a lot of time working around data access costs, and there are some cases where we just can't avoid them. They can really be considerable when you're working with large data sets, and if you can solve the problem of data transfer costs from the get-go, you'll be way ahead of many existing services built on S3 and Glacier.

3. dangoldin ◴[] No.38120310[source]
Author here but some ideas I was thinking about: - An open source data pipeline built on top of R2. A way of keeping data on R2/S3 but then having execution handled in Workers/Lambda. Inspired by what https://www.boilingdata.com/ and https://www.bauplanlabs.com/ are doing. - Related to above but taking data that's stored in the various big data formats (Parquet, Iceberg, Hudi, etc) and generating many more combinations of the datasets and choose optimal ones based on the workload. You can do this with existing providers but I think the cost element just makes this easier to stomach. - Abstracting some of the AI/ML products out there and choosing best one for the job by keeping the data on R2 and then shipping it to the relevant providers (since data ingress to them is free) for specific tasks. -
4. gen220 ◴[] No.38121409[source]
I'm building a "media hosting site". Based on somewhat reasonable forecasts of egress demand vs total volume stored, using R2 means I'll be able to charge a low take rate that should (in theory) give me a good counterposition to competitors in the space.

Basically, using R2 allows you to undercut competitors' pricing. It also means I don't need to build out a separate CDN to host my files, because Cloudflare will do that for me, too.

Competitors built out and maintain their own equivalent CDNs and storage solutions that are more ~10x more expensive to maintain and operate than going through Cloudflare. Basically, Cloudflare is doing to CDNs and storage what AWS and friends did to compute.

replies(1): >>38125641 #
5. ddorian43 ◴[] No.38125641[source]
Your competitors can do the same thing though?
replies(1): >>38129712 #
6. gen220 ◴[] No.38129712{3}[source]
That'd be welcome, I'm not really doing it to make money.

But reality is a bit more complicated than that. Migrating data + pointers to that data, en masse, isn't super easy (although things like Sippy make it easier).

In addition, there's all the capex that's gone into building systems around the assumptions of their blend data centers, homegrown CDNs, mix of storage systems. There's a sunk cost fallacy at play, as well as the inertia of knowing how to maintain the old system and not having any experience with the new system.

It's not impossible, but it'd require a lot of willpower and energy that these companies (who are 10+ years into their life cycles) don't really possess.

Having seen the inside of orgs like that before, starting from scratch is ~10x-100x easier, depending on the blend of bureaucracy on the menu.

replies(1): >>38138931 #
7. ddorian43 ◴[] No.38138931{4}[source]
I'm investigating the same thing. But my bet is that they will either change the terms or lower your cdn-cache size (therefore lowering performance, you can't serve popular videos without a CDN).

And the difference is that you will fail your customers when that time comes because you'll just get suspended (we've seen some cases here on the forum) and you'll have to come here to complain so the ceo/cto resumes things for you.

replies(1): >>38156930 #
8. gen220 ◴[] No.38156930{5}[source]
I don’t believe anybody on a paid plan has been suspended for using R2 behind the CDN? (I’ve seen the stories you’re alluding to. IIRC the cached files weren’t on R2)

In their docs they explicitly state it as an attractive feature to leverage, so that’d surprise me.

That being said, I’m not planning to serve particularly large files with any meaningful frequency, so in my particular case I’m not concerned about that possibility. (I’m distributing low bitrate audio, and small images, mostly).

If I were trying to build YouTube or whatever I’d be more concerned.

That being said, with their storage pricing and network set up as they are, I think they make plenty of money off of a hypothetical YouTube clone.

I do think they’ll raise prices eventually. But it’s a highly competitive space, so it feels like there’s a stable ceiling.

replies(1): >>38159599 #
9. ddorian43 ◴[] No.38159599{6}[source]
See https://news.ycombinator.com/item?id=34639212. They got suspended for using workers behind the CDN.

> I’m distributing low bitrate audio, and small images, mostly

This means the cache-size would be much smaller though.

replies(1): >>38163992 #
10. gen220 ◴[] No.38163992{7}[source]
Right, but they were serving content that wasn't from R2 as far as I understand from that thread. Not trying to say they that justifies their treatment, only that it doesn't apply to my use case. They were also seeing ~30TB of daily egress on a non-enterprise plan, which would absolutely never happen in my case – 1TB of daily egress would be a p99.9 event.

Re cache-size, maybe I've misunderstood what you mean by cache size limiting, but yeah that's my point – I don't need a massive cache size for my application. My data doesn't lend itself much to large and distributed spikes. Egress is spiky, but centralized to a few files at a time. e.g. if there were to be a single day where 1TB were downloaded at once, 80% of it would be concentrated into ~20 400MB-sized files.

replies(1): >>38174062 #
11. ddorian43 ◴[] No.38174062{8}[source]
He was ok by the terms though. Workers had/have the same terms as R2 before R2 got the new terms.

> They were also seeing ~30TB of daily egress on a non-enterprise plan, which would absolutely never happen in my case – 1TB of daily egress would be a p99.9 event.

I don't understand what media company you'll be competing against if you'll use just 30TB/month of bandwidth.