←back to thread

580 points huntaub | 8 comments | | HN request time: 0.001s | source | bottom

Hey HN, I’m Hunter the founder of Regatta Storage (https://regattastorage.com). Regatta Storage is a new cloud file system that provides unlimited pay-as-you-go capacity, local-like performance, and automatic synchronization to S3-compatible storage. For example, you can use Regatta to instantly access massive data sets in S3 with Spark, Pytorch, or pandas without paying for large, local disks or waiting for the data to download.

Check out an overview of how the service works here: https://www.youtube.com/watch?v=xh1q5p7E4JY, and you can try it for free at https://regattastorage.com after signing up for an account. We wanted to let you try it without an account, but we figured that “Hacker News shares a file system and S3 bucket” wouldn’t be the best experience for the community.

I built Regatta after spending nearly a decade building and operating at-scale cloud storage at places like Amazon’s Elastic File System (EFS) and Netflix. During my 8 years at EFS, I learned a lot about how teams thought about their storage usage. Users frequently told me that they loved how simple and scalable EFS was, and -- like S3 -- they didn’t have to guess how much capacity they needed up front.

When I got to Netflix, I was surprised that there wasn’t more usage of EFS. If you looked around, it seemed like a natural fit. Every application needed a POSIX file system. Lots of applications had unclear or spikey storage needs. Often, developers wanted their storage to last beyond the lifetime of an individual instance or container. In fact, if you looked across all Netflix applications, some ridiculous amount of money was being spent on empty storage space because each of these local drives had to be overprovisioned for potential usage.

However, in many cases, EFS wasn’t the perfect choice for these workloads. Moving workloads from local disks to NFS often encountered performance issues. Further, applications which treated their local disks as ephemeral would have to manually “clean up” left over data in a persistent storage system.

At this point, I realized that there was a missing solution in the cloud storage market which wasn’t being filled by either block or file storage, and I decided to build Regatta.

Regatta is a pay-as-you-go cloud file system that automatically expands with your application. Because it automatically synchronizes with S3 using native file formats, you can connect it to existing data sets and use recently written file data directly from S3. When data isn’t actively being used, it’s removed from the Regatta cache, so you only pay for the backing S3 storage. Finally, we’re developing a custom file protocol which allows us to achieve local-like performance for small-file workloads and Lustre-like scale-out performance for distributed data jobs.

Under the hood, customers mount a Regatta file system by connecting to our fleet of caching instances over NFSv3 (soon, our custom protocol). Our instances then connect to the customer’s S3 bucket on the backend, and provide sub-millisecond cached-read and write performance. This durable cache allows us to provide a strongly consistent, efficient view of the file system to all connected file clients. We can perform challenging operations (like directory renaming) quickly and durably, while they asynchronously propagate to the S3 bucket.

We’re excited to see users share our vision for Regatta. We have teams who are using us to build totally serverless Jupyter notebook servers for their AI researchers who prefer to upload and share data using the S3 web UI. We have teams who are using us as a distributed caching layer on top of S3 for low-latency access to common files. We have teams who are replacing their thin-provisioned Ceph boot volumes with Regatta for significant savings. We can’t wait to see what other things people will build and we hope you’ll give us a try at regattastorage.com.

We’d love to get any early feedback from the community, ideas for future direction, or experiences in this space. I’ll be in the comments for the next few hours to respond!

1. koolba ◴[] No.42174305[source]
Neat stuff. I think everybody with an interest in NFS has toyed with this idea at some point.

> Under the hood, customers mount a Regatta file system by connecting to our fleet of caching instances over NFSv3 (soon, our custom protocol). Our instances then connect to the customer’s S3 bucket on the backend, and provide sub-millisecond cached-read and write performance. This durable cache allows us to provide a strongly consistent, efficient view of the file system to all connected file clients. We can perform challenging operations (like directory renaming) quickly and durably, while they asynchronously propagate to the S3 bucket.

How do you handle the cache server crashing before syncing to S3? Do the cache servers have local disk as well?

Ditto for how to handle intermittent S3 availability issues?

What are the fsync guarantees for file append operations and directories?

replies(1): >>42174365 #
2. huntaub ◴[] No.42174365[source]
Thanks for the question!

> How do you handle the cache server crashing before syncing to S3? Do the cache servers have local disk as well?

Our caching layer is highly durable, which is (in my opinion) the key for doing this kind of staging. This means that once a write is complete to Regatta, we guarantee that it will eventually complete on S3.

For this reason, server crashes and intermittent S3 availability issues are not a problem because we have the writes stored safely.

> What are the fsync guarantees for file append operations and directories?

We have strong, read-after-write consistency for all connected file system clients -- including for operations which aren't possible to perform on S3 efficiently (such as renames, appends, etc). We asynchronously push those writes to S3, so there may be a few minutes before you can access them directly from the bucket. But, during this time, the file system interface will always reflect the up-to-date view.

replies(3): >>42174934 #>>42175879 #>>42175912 #
3. paulgb ◴[] No.42174934[source]
Congrats on the launch, this is really cool! Is the durable cache an attached disk, or are you using a separate AWS product for that?
replies(1): >>42174959 #
4. huntaub ◴[] No.42174959{3}[source]
Without getting too much into the details of the system, our durable cache is designed for 5 9s of durability (and we're working on a version that will provide 11 9s of durability soon). You can't achieve those durability numbers on a single attached NVMe device without some kind of replication.
5. koolba ◴[] No.42175879[source]
Is it fair to say this is best suited for small files that will be written infrequently?

There’s no partial write for s3 so editing a small range of a 1 GiB file would repeatedly upload the full file to the backing s3 right?

Or is the s3 representation not the same hierarchy as the presented mount point? (ie something opaque like a log structured / append only chunked list)

replies(1): >>42176003 #
6. the_duke ◴[] No.42175912[source]
So, I assume you use a journal in the cache server.

A few related questions:

* Do you use a single leader for a specific file system, or do you have a cluster solution with consensus to enable scaling/redundancy?

* How do you guarantee read-after-write consistency? Do you stream the journal to all clients and wait for them to ack before the write finishes? Or at least wait for everyone to ack the latest revisions for files, while the content is streamed out separately/requested on demand?

* If the above is true, I assume this is strictly viable for single-DC usage due to latency? Do you support different mount options for different consistency guarantees?

replies(1): >>42175986 #
7. huntaub ◴[] No.42175986{3}[source]
These are questions that are super specific to our implementation, that I'm hesitant to share publicly because they could change any at any time. I can share that we're designed to horizontally scale the performance of each file system, and our custom protocol will enable Lustre-like scale out performance. As for single- vs. multi-DC, I think that you'd be surprised at how much latency budget there is (a cross-DC round trip in AWS can be anywhere from 200us-700us, and EBS gp3 latencies are around 1000us).
8. huntaub ◴[] No.42176003{3}[source]
It's hard to define "best", and in many cases, the answers to these questions depend heavily on the workload and the caching parameters (how long do we wait before flushing to S3, etc). We are designed to provide good file system performance, even if customers are repeatedly writing small pieces of data to a 1 GiB file, so "best" in this case is a question of whether or not it's cost efficient.