Check out an overview of how the service works here: https://www.youtube.com/watch?v=xh1q5p7E4JY, and you can try it for free at https://regattastorage.com after signing up for an account. We wanted to let you try it without an account, but we figured that “Hacker News shares a file system and S3 bucket” wouldn’t be the best experience for the community.
I built Regatta after spending nearly a decade building and operating at-scale cloud storage at places like Amazon’s Elastic File System (EFS) and Netflix. During my 8 years at EFS, I learned a lot about how teams thought about their storage usage. Users frequently told me that they loved how simple and scalable EFS was, and -- like S3 -- they didn’t have to guess how much capacity they needed up front.
When I got to Netflix, I was surprised that there wasn’t more usage of EFS. If you looked around, it seemed like a natural fit. Every application needed a POSIX file system. Lots of applications had unclear or spikey storage needs. Often, developers wanted their storage to last beyond the lifetime of an individual instance or container. In fact, if you looked across all Netflix applications, some ridiculous amount of money was being spent on empty storage space because each of these local drives had to be overprovisioned for potential usage.
However, in many cases, EFS wasn’t the perfect choice for these workloads. Moving workloads from local disks to NFS often encountered performance issues. Further, applications which treated their local disks as ephemeral would have to manually “clean up” left over data in a persistent storage system.
At this point, I realized that there was a missing solution in the cloud storage market which wasn’t being filled by either block or file storage, and I decided to build Regatta.
Regatta is a pay-as-you-go cloud file system that automatically expands with your application. Because it automatically synchronizes with S3 using native file formats, you can connect it to existing data sets and use recently written file data directly from S3. When data isn’t actively being used, it’s removed from the Regatta cache, so you only pay for the backing S3 storage. Finally, we’re developing a custom file protocol which allows us to achieve local-like performance for small-file workloads and Lustre-like scale-out performance for distributed data jobs.
Under the hood, customers mount a Regatta file system by connecting to our fleet of caching instances over NFSv3 (soon, our custom protocol). Our instances then connect to the customer’s S3 bucket on the backend, and provide sub-millisecond cached-read and write performance. This durable cache allows us to provide a strongly consistent, efficient view of the file system to all connected file clients. We can perform challenging operations (like directory renaming) quickly and durably, while they asynchronously propagate to the S3 bucket.
We’re excited to see users share our vision for Regatta. We have teams who are using us to build totally serverless Jupyter notebook servers for their AI researchers who prefer to upload and share data using the S3 web UI. We have teams who are using us as a distributed caching layer on top of S3 for low-latency access to common files. We have teams who are replacing their thin-provisioned Ceph boot volumes with Regatta for significant savings. We can’t wait to see what other things people will build and we hope you’ll give us a try at regattastorage.com.
We’d love to get any early feedback from the community, ideas for future direction, or experiences in this space. I’ll be in the comments for the next few hours to respond!
Snowflake and Databricks aren't storage products, but are managed compute platforms on top of storage that probably looks a lot like this. Snowflake allows you to easily connect different data sets to your data warehouse, and Databricks provides a managed analytics (Spark) offering.
Regatta, on the other hand, would allow you to more easily build the next Snowflake or Databricks by taking advantage of the same low-cost, unlimited storage in S3 that they likely use.
> Under the hood, customers mount a Regatta file system by connecting to our fleet of caching instances over NFSv3 (soon, our custom protocol). Our instances then connect to the customer’s S3 bucket on the backend, and provide sub-millisecond cached-read and write performance. This durable cache allows us to provide a strongly consistent, efficient view of the file system to all connected file clients. We can perform challenging operations (like directory renaming) quickly and durably, while they asynchronously propagate to the S3 bucket.
How do you handle the cache server crashing before syncing to S3? Do the cache servers have local disk as well?
Ditto for how to handle intermittent S3 availability issues?
What are the fsync guarantees for file append operations and directories?
> How do you handle the cache server crashing before syncing to S3? Do the cache servers have local disk as well?
Our caching layer is highly durable, which is (in my opinion) the key for doing this kind of staging. This means that once a write is complete to Regatta, we guarantee that it will eventually complete on S3.
For this reason, server crashes and intermittent S3 availability issues are not a problem because we have the writes stored safely.
> What are the fsync guarantees for file append operations and directories?
We have strong, read-after-write consistency for all connected file system clients -- including for operations which aren't possible to perform on S3 efficiently (such as renames, appends, etc). We asynchronously push those writes to S3, so there may be a few minutes before you can access them directly from the bucket. But, during this time, the file system interface will always reflect the up-to-date view.
An annoying feature of EFS is how it scales with amount of storage, so when its empty its very slow. We also started hitting its limits so could not scale our compute workers. Both can be solved by paying for the elastic iops but that is VERY expensive.
Can you support these operations with the expected semantics and performance?
If the application makes a one-byte change to a giant file and calls fdatasync, what happens? Do you re-upload the entire file to S3?
How do you handle a rename? Applications commonly do this for atomic replacement on POSIX and expect three properties from this operation:
* fast. * destination always points to either the original or new afterward (on success or failure); no scenario at which it's lost/truncated. * no extra storage used (on success or failure).
Do you guarantee any of those? How? I don't see an obvious way from the S3 HTTP API.
Given that POSIX API doesn't support things like arbitrary per-operation deadlines/timeouts, do you think it's suitable as a distributed filesystem API at all? Why?
Once the operation is stored in our durable cache, then we update your S3 bucket to match what the file system expects. This generally takes around a minute, but could take longer depending on the number of S3 operations a file operation translates to (for example, a directory rename requires that CopyObject each object in the directory in S3).
I think that the POSIX API is to here to stay (like the S3 API). I agree that it would be better to have timeouts and deadlines, but I don't think that those make it impossible to provide a good distributed file system experience on POSIX (look at Amazon's EFS, Oracle's FSS, Google's FileStore, etc). It just makes the bar for availability higher.
I believe they stopped supporting that mode because they didn’t want to keep chasing every S3 protocol change. However, if you’re just using S3, and not trying to masquerade as S3, this problem becomes easier.
1. If I had a local disk which was 10 GB, what happens when I try to contend with data in the 50 GB range (as in, more that could be cached locally?) Would I immediately see degradation, or thrashing, at the 10 GB mark?
2. Does this only work in practice on AWS instances? As in, I could run it on a different cloud, but in practice we only really get fast speeds due to running everything within AWS?
3. I've always had trouble with FUSE in different kinds of docker environments. And it looks like you're using both FUSE and NFS mounts. How does all of that work?
4. Is the idea that I could literally run Clickhouse or Postgres with a regatta volume as the backing store?
5. I have to ask - how do you think about open source here?
6. Can I mount on multiple servers? What are the limits there? (ie, a lambda function.)
I haven't played with the so maybe doing so would help answer questions. But I'm really excited about this! I have tried using EFS for small projects in the past but - and maybe I was holding it wrong - I could not for the life of me figure out what I needed to get faster bandwidth, probably because I didn't know how to turn the knobs correctly.
1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
... I'm kidding, this is quite useful.
I really wish that NFSv3 and Linux had built-in file hashing ioctls that could delegate some of this expensive work to the backend as it would make it much easier to use something like this as a backup accelerator.
> If I had a local disk which was 10 GB, what happens when I try to contend with data in the 50 GB range (as in, more that could be cached locally?) Would I immediately see degradation, or thrashing, at the 10 GB mark?
We don't actually do caching on your instance's disk. Instead, data is cached in the Linux page cache (in memory) like a regular hard drive, and Regatta provides a durable, shared cache that automatically expands with the working set size of your application. For example, if you were trying to work with data in the 50 GiB range, Regatta would automatically cache all 50 GiB -- allowing you to access it with sub-millisecond latency.
> Does this only work in practice on AWS instances? As in, I could run it on a different cloud, but in practice we only really get fast speeds due to running everything within AWS?
For now, yes -- the speed is highly dependent on latency -- which is highly dependent on distance between your instance and Regatta. Today, we are only in AWS, but we are looking to launch in other clouds by the end of the year. Shoot me an email if there's somewhere specifically that you're interested in.
> I've always had trouble with FUSE in different kinds of docker environments. And it looks like you're using both FUSE and NFS mounts. How does all of that work?
There are a couple of different questions bundled together in this. Today, Regatta exposes an NFSv3 file system that you can mount. We are working on a new protocol which will be mounted via FUSE. However, in Docker environments, we also provide a CSI driver (for use with K8s) and a Docker volume plugin (for use with just Docker) that handles the mounting for you. We haven't released these publicly yet, so shoot me an email if you want early access.
> Is the idea that I could literally run Clickhouse or Postgres with a regatta volume as the backing store?
Yes, you should be able to run a database on Regatta.
> I have to ask - how do you think about open source here?
We are in the process of open sourcing all of the client code (CSI driver, mount helper, FUSE), but we don't have plans currently to open source the server code. We see the value of Regatta in managing the infrastructure so you don't have to, and if we release it via open-source, it would be difficult to run on your own.
> Can I mount on multiple servers? What are the limits there? (ie, a lambda function.)
Yes, you can mount on multiple servers simultaneously! We haven't specifically stress-tested the number of clients we support, but we should be good for O(100s) of mounts. Unfortunately, AWS locks down Lambda so we can't mount arbitrary file systems in that environment specifically.
> efs performance
Yes, the challenge here is specifically around the semantics of NFS itself and the latency of the EFS service. We think we have a path to solving both of these in the next month or two.
> I really wish that NFSv3 and Linux had built-in file hashing ioctls that could delegate some of this expensive work to the backend as it would make it much easier to use something like this as a backup accelerator.
Tell me a bit more about what you mean here. We're interested in really pushing the limits of what a storage system can do, so I'd be potentially interested.
Also, NFSv3 and not 4?
Philosophically, our goal is to build a standard that can be used in these kinds of applications moving forward, so that application developers don't need to build streaming over and over again and users don't need to learn how to configure each individual systems' caching.
We selected NFSv3 due to it's broad compatibility with different compute environments. For example, Windows has an NFSv3 client in it, but doesn't have an NFSv4 client. There are lots of enterprise workloads which needs simultaneous access to file data from both Windows and Linux, and supporting NFSv3 was the easiest path to support those workloads.
On the other hand, Regatta is designed as a cloud-native gateway product. Regatta's elastic, durable caching layer allows us to efficiently cache large data sets without thrashing, and always efficiently perform writes. Because Regatta is designed to be highly-available, customers don't have to worry about downtime for patching or deployments.
When the cache got full, catfs would evict things from it pretty simply. It's overall got a good design but has a few bugs you have to fix, and when you have 100 machines connecting to it, it requires some tuning to make sure that it doesn't all stall. But it worked for the most part.
Anyway, I think this is cool tech. I'm currently doing some bioinformatics stuff that this might help with (each genome sequence is some 100 GiB compressed). I'll give it a shot some time in the next couple of months.
Service Tier: Zonal
Location: us-central1
10 TiB instance at $0.35/TiB/hr
Monthly cost: $2,560.00
Performance Estimate:
Read IOPS: 92,000
Write IOPS: 26,000
Read Throughput: 2,600 MiB/s
Write Throughput: 880 MiB/s
So you are saying that regatta's own SaaS infrastructure provides the disk caching layer. So you all make sure the pipe between my AWS instance and your servers are very fast and "infinitely scalable", and then the sync to S3 happens after the fact.
But anyway, from your YCombinator blurb:
"When you’re done editing data, it automatically flows back to S3 within a few minutes"
Does this mean Regatta trades consistency for cost (S3 and EBS and local storage are all CP systems these days)?In some sense, yes. But, the consistency that you're trading is only for accessing data simultaneously through the file interface and the S3 interface simultaneously. The consistent is CP/strong when you access the data through the file interface. The model that we see most often work is folks will ingest data through S3 (for example, an 'input/' prefix), and then the file system will process that data and place it in a different directory (for example, an 'output/' folder). Then, if it takes a minute or two for those to update on the other side, it's not a big deal.
I’m specifically interested in how you’re handling synchronization between the NFS layer and S3 wrt fsync. The description says that data is “asynchronously” written back out to S3. That implies to me that it’s possible for something like this to happen:
1. I write to a file and fsync it
2. Your NFS layer makes the file durable and returns
3. Your NFS layer crashes (oh no, the intern merged some bad terraform!) before it writes back to S3
4. I go to read the file from S3… and it’s not there!
Is that possible? IE is the only way to get a consistent view of the data by reading “through” the nfs layer, even if I fsync?
Why local storage? We’re going to have multiple processes reading & writing to the files and need locking & shared memory semantics you can’t get w/ NFS. I could implement pin/unpin myself in user space by copying stuff between /mnt/magic-nfs and /mnt/instance-nvme but at that point I’d just use S3 myself.
Any thoughts about providing a custom file system or how to assemble this out of parts on top of the NFS mount?
Every few months of this spend is like buying a server
Edit: back at my pc and checked, relevant bare metal is ~$500/m, amortized:
https://baremetalsavings.com/c/LtxKMNj
Edit 2: for 100tb..
We are running it as a managed SaaS, so our customers connect to the caching layer that runs in the Regatta VPC. This allows us to manage the infrastructure for them and keep costs low.
Storage Gateway is an interesting product, and I worked closely with that team for several years -- so mad respect for them. It was designed to be an appliance that you run on servers in your own data center (of course, many customers now deploy it to EC2). Because of this, it's designed to operate in an environment with "finite storage" -- for example, different workload pattterns can thrash the cache, which results in poor performance to clients, and it's not designed to run in a high-availability cluster in the cloud. Regatta solves these problems with durable cache storage that's safe to data in long-term, and is designed for high-availability.
“All fsync-ed writes will eventually make it to S3, but fsync successfully returning only guarantees that writes are durable in our NFS caching layer, not in the S3 layer”?
Out of curiosity, why did you choose EFS, it's insanely expensive at even modest scales?
Great question! We fill the same role as AWS Storage Gateway (and I used to work closely with that team when I was at AWS, lots of respect for what they do). AWS Storage Gateway is built primarily as an appliance to be installed on instances in your own data center to ease migration to the cloud. Many customers do deploy Storage Gateway on EC2 because they want these features in the cloud itself. However, the "appliance" design of Storage Gateway makes it unsuitable for this purpose. For example, Storage Gateway is not designed to run in a cluster for high-availability and doesn't have access to durable, long-term storage to stage and cache writes.
On the other hand, Regatta is designed as a cloud-native gateway product. Regatta's elastic, durable caching layer allows us to efficiently cache large data sets without thrashing, and always efficiently perform writes. Because Regatta is designed to be highly-available, customers don't have to worry about downtime for patching or deployments.
* What does "$0.05 / gigabyte transferred" mean exactly. Transferred outside of AWS or accessed as in read and written data?
* "$0.20/GiB-mo of high-speed cache" – how is the high-speed cache amount computed?
Found this in the docs:
> By default, Regatta file systems can provide up to 10 Gbps of throughput and 10,000 IOPS across all connected clients.
Is that the lower bound? The 50 TiB filestore instance has 104 Gbps read through put (albeit at a relatively high price point).
We need to update the home page with these details, but $0.05 is only charged on transfer between Regatta and S3. We calculate your cache usage minutely and tally it into a monthly usage amount that we then bill for.
we're using Filestore out of convenience right now, but actively exploring alternatives.
In terms of storing in s3 - is that in your buckets? Sound like the plan is to run the caching on your infrastructure, are there plans to allow customers to run those instances themselves?
Presumably the format within s3 is your own bespoke format? What does the migration strategy look like for people looking to move into or out of your infrastructure? They effectively pull everything down from their s3 to the local “filesystem”?
You don't actually directly charge for storage itself, so I assume this a "bring your own s3 bucket" type of deal, correct?
How long does data, that is no longer being accessed sit in the cache and count towards billing?
As for availability, are you in the process or do you have plans to also support Google Cloud?
That's correct -- we store data in the customer S3 bucket.
> How long does data, that is no longer being accessed sit in the cache and count towards billing?
We keep data in the cache for up to 1 hour after you've stopped accessing it.
> As for availability, are you in the process or do you have plans to also support Google Cloud?
We have plans to support Google Cloud. If you're interested in using us from GCP, I'd recommend setting up some time to chat (either use the website or email me at hleath [at] regattastorage.com). We are prioritizing where we launch our infrastructure next based on customer demand.
When I got in touch about that, I was confronted with a wall of TCO papers, which tells me the product managers evidently believe their target segment to be Gartner-following corporate drones. This was a further deterrent.
We threw that idea away and used memcached instead, with common static files in a package in S3.
I guess I’m suggesting, don’t be like EFS when it comes to pricing or reaching customers.
There’s no partial write for s3 so editing a small range of a 1 GiB file would repeatedly upload the full file to the backing s3 right?
Or is the s3 representation not the same hierarchy as the presented mount point? (ie something opaque like a log structured / append only chunked list)
A few related questions:
* Do you use a single leader for a specific file system, or do you have a cluster solution with consensus to enable scaling/redundancy?
* How do you guarantee read-after-write consistency? Do you stream the journal to all clients and wait for them to ack before the write finishes? Or at least wait for everyone to ack the latest revisions for files, while the content is streamed out separately/requested on demand?
* If the above is true, I assume this is strictly viable for single-DC usage due to latency? Do you support different mount options for different consistency guarantees?
I'm also using Cloudflare R2 (S3 compatible) and would love for that to work out of the box
I wouldn't want to host fastqs or something and use this for alignment, but for spot checking raw fastqs it could be nice
But it's not clear how it handles file update conflicts. For example: if User A updates File X on one computer, and User B updates File X on another computer, what does the final file look like in S3?
Right now we spend a lot of time downloading various stuff from HTTP or S3 links and then figuring out folder structures to keep them in our S3 buckets. Pooch really simplifies the caching for this by having a deterministic path on your local storage for downloaded files, but has no S3 backend.
So a combination of 2 would be to just have 1 call to a link that would embed the caching both locally and on our S3 buckets deterministically.
[0] https://www.fatiando.org/pooch/latest/ [1] https://s3fs.readthedocs.io/en/latest/
SF3s/boto/botocore versions x Scala/Spark x parquet x iceberg x k8s etc readers own assumptions makes reading from S3 alone a maintenance and compatibility nightmare.
Will the mounted system _really_ be accessible as local fs and seen as such to all running processes? No surprises? No need for python specific filesystem like S3Fs?
If so then you will win 100% I wouldn't even care about speed/cost if it's up to par with s3
> Will the mounted system _really_ be accessible as local fs and seen as such to all running processes? No surprises? No need for python specific filesystem like S3Fs?
Ha, well it depends on what you mean by surprises. We won't have a Python-specific file system. Our client is going to come in two flavors. Today, you can mount Regatta over NFSv3 (which we wrap in TLS to make it secure). This works for some workloads, but doesn't provide like-for-like performance with EBS. Over the next month, we plan to release the "custom protocol" that I wrote about above, that we expect to send to customers in the form of a FUSE file system.
Either way, it should be one package, you shouldn't need to worry about versioning, and it will appear as a real, local file system. :D
Rclone, on the contrary, has no layer that would guarantee consistency among parallel clients.
[0] https://docs.regattastorage.com/details/architecture#overvie...
I'd love to know a bit more about why you're looking for an open source alternative. Is it because of costs (i.e. you'd like an open source alternative that doesn't require you to pay) or if it's because of the operating environment (i.e. you want an open source alternative so that you can deploy it to your own infrastructure)? There are some things that we are exploring around deploying onto your own infrastructure over the next 12 months, but I'd love to learn more. Feel free to respond here or email me at hleath [at] regattastorage.com.
What's cool about the storage market is that there are so many impressive companies because there are so many varied needs from customer applications! We're hoping to become a simple "default" for teams who are writing applications in the cloud.
Perfect.
For IBM, I wrote a crypto filesystem that works similarly in concept, except it was a kernel filesystem. We crypto split the blocks up into 4 parts, stored into cache. A background daemon listened to events and sync'ed blocks to S3 orchestrated with a shared journal.
It's pure magic when you mount a filesystem on clean machine and all your data is "just there."
That kickstarted about a decade in (actual) research and development led by my team which positioned the Bucharest center as one of the most prolific centers in distributed systems within Adobe and of Adobe within Romania.
But I didn't come up with the concept, it was Richard Jones that inspired us with the Gmail drive that used FUSE with gmail attachments back in 2004 when I got my first while still in college https://en.wikipedia.org/wiki/GMail_Drive. I guess I'm old, but I find it funny to see Launch HN: Regatta Storage (YC F24) – Turn S3 into a local-like, POSIX cloud FS
I totally agree! I am hoping that Regatta can power a future where teams don't need more than ~8 GiB of local storage for their operating system, and can store the rest on something like Regatta to get rid of the waste of overprovisioned block volumes.
If you want early access to other clouds or the CSI driver, feel free to email hleath [at] regattastorage.com.
Thanks for the question! Mountpoint for Amazon S3 is a FUSE layer that doesn't support full POSIX semantics. For example, you can't use Mountpoint for Amazon S3 for random writes to existing files, appends, or renames. This means that you have to carefully instrument your application to understand whether or not it's compatible with Mountpoint, which can be error-prone. Regatta, on the other hand, provides full POSIX compatibility for the file interface, which means that it works out-of-the-box with all file based applications.
We had built an MLOps platform[0] a few years ago and enabled users to use their S3 buckets in a "file system like" manner. This made it possible for them not to have to know or write S3 specific code in their Jupyter notebooks as most people in the industry did with boto3, which also forced them to write code (say using TensorFlow) in a certain way for training to consume the files, err, objects. It was a mess, and we removed that for notebooks that could run the same way on a laptop or on the platform, even with the shell kernel so people could explore objects like files. MLFlow could work on a filesystem or on S3, but it had no authentication, so we built around that to know which user/experiment produced which artifact.
MinIO had a Gateway that was deprecated. We didn't use it much and they didn't have an admin client at the time, so I rolled one up to orchestrate the thing.
One way I did it that hook into users' compute and storage as opposed to offering storage/compute was for two reasons:
- Organizations already had their data somewhere with established policies. Getting them to move that data is very hard (CISO, CTO, IT, legal, engineers). Friction would have been huge.
- Organizations already had budgeted compute and storage, they may have had contracts/discounts/credits with cloud providers and it didn't make sense to ask them to make a decision on budgeting for another solution.
- A design principle of having the product being able to die without leaving the users scrambling to exfil/migrate data.
One way to do it was to handle FUSE, and your mileage may vary (s3fs-fuse, goofys, etc). Amazon has released Mountpoint last year[1], and one question you'll get asked is why use Regatta when I could use Mountpoint?
Less friction for engineers and execs.
In any way, congratulations on the launch, man!
[0]: https://web.archive.org/web/20230325150132/https://iko.ai/
[1]: https://aws.amazon.com/blogs/aws/mountpoint-for-amazon-s3-ge...
they are all the same and they are all more than what would at the surface seem that it's "just files" the whole OS, especially Linux/UNIX is "just files" and if you look deeper at databases you can see how it boils down to the file formats (something that was visible with LevelDB but maybe less so with RocksDB, I guess)
I agree around the questions with Mountpoint, and we're solving a very different set of problems than Mountpoint. Mountpoint, for example, isn't designed to be used with all file applications and lacks support for things like appends to existing files, random writes, renames, and symbolic links. On the other hand, Regatta supports POSIX semantics and can work with nearly all file based applications.
Or does Regatta only have access to filesystem metadata -- enough to do POSIX stuffs like locks, mv, rm -- but the file contents themselves remain encrypted end-to-end?
Source: personal experience, I've done the EFS path and the S3-like path within the same system, and the latter was much easier to develop for and troubleshoot performance. It's also far cheaper to operate.
You can have local caching, rapid "read what I wrote", etc. with very little engineering cost, no one at my company is dedicated to this because the abstraction is ridiculously simple:
1. It's object storage, not a file system. Embrace immutability.
2. When you write to S3, cache locally as well.
3. When you read from S3, check the cache first. Optionally cache locally on reads from S3.
4. Set cache sizes so you don't blow out local storage.
5. Tier your caches when needed to increase sharing. (Immutability makes this trivially safe.)
All that's left is to manage 'checked out files' which is pretty easy when almost all of them are immutable anyway.
However, like the S3 protocol, I think that the file protocol is cemented in time as something that we will be using 100 years from now. For example, most AI applications do still download data sets to local file system devices to actually load and use, this is why you see a lot of HPC workloads use things like Lustre. Postgres, SQLite, etc all use file system semantics to operate the database.
I totally respect folks who rewrite their applications to work directly with S3, but as you point out, it comes with a different set of challenges (around caching and chunking).
If you run a single-digit number of servers and replace them every 5 years you will probably never get a hardware failure. If you're unlucky and it still happens get someone to diagnose what's wrong, ship replacement parts to the data center and pay their tech to install them in your server.
Bare metal at scale is difficult. A small number of bare metal servers is easy. If your needs are average enough you can even just rent them so you don't have capital costs and aren't responsible for fixing hardware issues.
definitely the thing I want to hear more about. Also, I can't help shake the "what's the catch, how is no one else doing this, or are they doing it quietly?" feeling.
It's similar to JuiceFS, but JuiceFS writes and reads data from S3 in a proprietary block format. This means that you cannot connect JuiceFS to existing data sets in S3, and you cannot use data written through JuiceFS from the S3 API directly. On the other hand, Regatta reads and writes data to S3 using it's native format -- so you can do these things!
rclone can work with AWS' different offerings, some of which at least partially address this: https://aws.amazon.com/blogs/aws/new-amazon-s3-express-one-z...
As for Express One Zone providing consistency, it would make more groups of operations consistent, provided that the clients could access the endpoints with low latency. It wouldn't be a guarantee but it would be practical for some applications. It depends on what the problem is - for instance, do you want someone to never see noticeably stale data? I can definitely see that happening with Express One Zone if it's as described.
It's similar to JuiceFS, but JuiceFS writes and reads data from S3 in a proprietary block format. This means that you cannot connect JuiceFS to existing data sets in S3, and you cannot use data written through JuiceFS from the S3 API directly. On the other hand, Regatta reads and writes data to S3 using its native format -- so you can do these things!
The hard part is a cache layer with immediate consistency. It likely requires RAFT (or, otherwise, works incorrectly). Integration of this cache layer with S3 (offloading cold data to S3) is easy (not interesting).
It should not be compared to s3fs, mountpoint, geesefs, etc., because they lack consistency and also slow and also don't support full filesystem semantics, and break often.
It could be compared with AWS EFS. Which is also slow (but I didn't try to tune it up to maximum numbers).
For ClickHouse, this system is unneeded because ClickHouse is already distributed (it supports full replication or shared storage + cache), and it does not require full filesystem semantics (it pairs with blob storages nicely).
Also using a translation layer on top of S3 will not save your costs.
I agree with you, Object Storage accels at making the storage interface super simple to use (POSIX is incredibly complex). However, that doesn't change the reality that nearly all software still reads and writes data from a local file system interface.
The specifics of whether or not using a translation layer will save you costs comes down a lot to what you're comparing it to. If you have an EBS volume that's 20% full, then I guarantee you that Regatta's storage costs will be cheaper than EBS, even if you don't ever tier to S3. It's just a cherry on top for workloads which may have unpredictable access patterns and don't want all of their data to be hot when not in use.
S3 semantics are generally fairly terrible for file storage (no atomic move/rename is just one example) but using it as block storage a la ZFS is quite clever.
I've looked for a solution to write many small files fast (safely). Think about cloning thr Linux kernel git repo. Whatever I tested, the NFS protocol was always a bottleneck.
I have mutual friends with some of the Nasuni folks, and I have a lot of respect for what they do. In particular, Nasuni stores data in a proprietary block format in your S3 bucket, so you can't connect it to existing data sets or use that data directly from S3 out the other side. Whereas with Regatta, we store data in its native format in S3 so you can do these things.
We are still working hard on it, hoping that we can help people with different workloads with different tech!
If the caching layer can return success before writing through to s3, it means you built a strongly consistent distributed in memory database.
Or, the consistency guarantee is actually less, or data is partitioned and cannot be quickly shared across clients.
I'm really curious to understand how this was implemented.
How are concurrent updates to the same file handled? Either only one client can open in write at any one time, or you need fencing tokens.
For concurrent updates, the standard practice for remote file systems is to use file locking to coordinate concurrent writes. Otherwise, NFS doesn't have any guarantees about WRITE operation ordering. If you're talking about concurrent writes which occur from NFS and S3 simultaneously, this leads to undefined behavior. We think that this is okay if we do a good job at detecting and alerting the user if this occurs because we don't think that there are applications currently written to do this kind of simultaneous data editing (because Regatta didn't exist yet).
Consistency at the individual file can be guaranteed this way, but I don't think this works across multiple files (as you need a global total order of operations). In any case, this is a pragmatic solution, and I like the tradeoffs. Comparing against NFS rather than Spanner seems the right way to look at it.
S3 bucket systems for cloud hosting services are typically encrypted through AES-256. SSE-S3 or SSE-KMS are available upon request.
[1]: https://aws.amazon.com/blogs/aws/new-amazon-s3-encryption-se...
Having the API hosted on Regatta's servers but integrating a POSIX-compliant bring-your-own compute would tighten up instance storage fees for the end-user.
[1]:https://aws.amazon.com/blogs/aws/new-amazon-s3-encryption-se...
I see you've made some similar decisions to what we did for similar reasons I think - making sure files are stored 1:1 exactly as an object without some proprietary backend scrambling, offering strong consistency and POSIX semantics on the file storage, with eventual consistency between S3 and POSIX interfaces, and targeting high performance. Looks like we differ on the managed service vs traditional download and install model, and the client-first vs server-first approach (though some of our users also run cunoFS on an NFS/SMB gateway server), and caching is a paid feature for us versus an included feature for yours.
Look forward to meeting and seeing you at storage conferences!
Some things that are hidden in the cloud providers cost are redundant networking, redundant internet connection, redundant disks.
Likely still cheaper than the cloud obviously but you will need to stomach down time for that stuff if something breaks.
Re: bring your own compute: It’s certainly something we’re thinking about. We are in discussions with a lot of customers running GPU clusters with orphaned NVMe resources that they would like to install Regatta on. We’d love to get more details on who’s out there looking for this, so please shoot me an email at hleath [at] regattastorage.com
I'm also not sure that its a good architecture to have your servers inbetween my S3. If i'm on one cloud provider, the traffic between their S3 compatible solution and my infrastructure is most of the time in the same cloud provider. And if not, i will for sure have a local cache rcloning the stuff from left to right.
I also don't get your calculator at all.
> If i'm on one cloud provider, the traffic between their S3 compatible solution and my infrastructure is most of the time in the same cloud provider
This is exactly right, and it's why we're working to deploy our infrastructure to every major cloud. We don't want customers paying egress costs or cross-cloud latency to use Regatta.
> I also don't get your calculator at all.
This could probably use a bit more explanation on the website. We're comparing to the usage of local devices. We find that, most often, teams will only use 15% of the EBS volumes that they've purchased (over a monthly time period). This means that instead of paying $0.125/GiB-mo of storage (like io2 offers), they're actually paying $0.833/GiB-mo of actual bytes stored ($0.125/15%). Whereas on Regatta, they're only paying for what they use -- which is a combination of our caching layer ($0.20) and S3 ($0.025). That averages out closer to $0.10/GiB stored, depending on the amount of data that you use.
Btw. while your experience works well for Netflix, in my company (also very big), we have LoBs and while different teams utilize their storage in a different way, none of us are aligned on a level that we would benefit directly from your solution.
From a pure curiosity point of view: Do you have already enough customers which have savings? What are their use cases? The size of their setups?
I ask because last time I checked, S3 wouldn't let you "patch" an object. So you'd have to push the diff as separate objects and then "reconstruct" the original file client-side as different chunks are read, right?
That's correct, and it's something that we can tune if there's a specific need. For AI use cases specifically, we're working on adding functionality to "pre-load" the cache with your data. For example, you would be able to call an API that says "I'm about to start a job and I need this directory on the cache". We would then be able to fan out our infrastructure to download that data very quickly (think hundreds of GiB/s) -- much faster than any individual instance could download the data. Then your job would be able to access the data set at low-latency. Does that sound like it would make sense for you?
> Btw. while your experience works well for Netflix, in my company (also very big), we have LoBs and while different teams utilize their storage in a different way, none of us are aligned on a level that we would benefit directly from your solution.
I'm not totally sure what you mean here. I don't anticipate that a large organization would have to 100% buy-in to Regatta in order to get benefits. In fact, this is the reason why we are so intent on having a serverless product that "scales to 0". That would allow each of your teams to independently try Regatta without needing to spend hundreds of thousands of dollars on something Day 1 for the entire company.
> From a pure curiosity point of view: Do you have already enough customers which have savings? What are their use cases? The size of their setups?
These are pretty intimate details about the business, and I don't think I can share very specific data. However, yes -- we do have customers who are realizing massive savings (50%+) over their existing set ups.
The good news is that you, personally, don't have to spend the time to create the Jepsen test harness, you can pay them to run the test but I have no idea what kind of O($) we're talking here. Still, it could be worth it to inspire confidence, and is almost an imperative if you're going to (ahem) roll your own protocol for network file access :-/
That said, I feel like writeback caching is a bit ... risky? That is, you aren't treating the object store as the source of truth. If your caching layer goes down after a write is ack'ed but before it's "replicated" to S3, people lose their data, right?
I think you'll end up wanting to offer customers the ability to do strongly-consistent writes (and cache invalidation). You'll also likely end up wanting to add operator control for "oh and don't cache these, just pass through to the backing store" (e.g., some final output that isn't intended to get reused anytime soon).
Finally, don't sleep on NFSv4.1! It ticks a bunch of compliance boxes for various industries, and then they will pay you :). Supporting FUSE is great for folks who can do it, but you'd want them to start by just pointing their NFS client at you, then "upgrading" to FUSE for better performance.
This is exactly why we're building our caching layer to be highly-durable, like S3 itself. We will make sure that the data in the cache is safe, even if servers go down. This is what gives us the confidence to respond to the client before the data is in S3. The big difference between the data living in our cache and the data living in S3 is cost and performance, not necessarily durability.
> I think you'll end up wanting to offer customers the ability to do strongly-consistent writes (and cache invalidation). You'll also likely end up wanting to add operator control for "oh and don't cache these, just pass through to the backing store" (e.g., some final output that isn't intended to get reused anytime soon).
I think this is exactly right. I think that storage systems are too often too hands off about the data (oh, give us the bytes and we will store them for you). I believe that there are gains to be had by asking the users to tell you more about what they're doing. If you have a directory which is only used to read files and a directory which is only used to write files, then you probably want to have different cache strategies for those directories? I believe we can deliver this with good enough UX for most people to use.
> Finally, don't sleep on NFSv4.1! It ticks a bunch of compliance boxes for various industries, and then they will pay you :). Supporting FUSE is great for folks who can do it, but you'd want them to start by just pointing their NFS client at you, then "upgrading" to FUSE for better performance.
I certainly don't, and this is why we are supporting NFSv3 right now. That's not going away any time soon. We want to offer something that's highly compatible with the industry at large today (NFS-based, we can talk specifics about whether or not that should be v3 or v4) and then something that is high-performance for the early adopters who can use something like FUSE. I think that both things are required to get the breadth of customers that we're looking for.
I don't at all disagree that it's a hard problem! That's part of what makes it so fun to work on.
Any plans to expand to other stores, like R2 (I ask since unlike S3, R2 egress is free)?
Given Hunter worked at AWS, I bet they are way too familiar with IAD.
We support all S3-compatible storage services today, including R2, GCS, and MinIO.
It takes somtimes years to fill it up with photos, vidoes and other documents. Sounds like one could build a great killer low amortized – pay as you fill it up – service for people to compete with dropbox.
On the other hand, something like Dropbox is actually a program running on your laptop that simulates a file system, and then does the synchronization at the file level as needed. I think that there's probably some latent demand for a similar product for developers to access their S3 buckets easily from their laptops, and it's something we might look into as we get farther along.
I'll be looking closely in what you're building!
[1] https://www.dell.com/en-us/blog/welcoming-spanning-maginatic...
[2] https://www.slideshare.net/slideshow/maginatics-sdcdwl/39257...
I am the same, as distributed consensus is notoriously hard especially when it fronts distributed storage.
However, it is not imposssible.. Hunter and I were both in the EFS team at AWS (I am still there), and he was deeply involved in all aspects of our consensus and replication layers. So if anyone can do it, Hunter is!
The deeper reason for that is, that the consistency guarantees from NFS (close-to-open consistency) are a lot weaker than what you get from POSIX.
Using NFS and being able to use an existing bucket is a nice way to make it easy to get started and try things out. For applications that need full consistency between the S3 and the filesystem view, you can even provide an S3 proxy endpoint on your durable cache that removes any synchronization delays.
Responding here to say that I’d love to hear more about your comparison to FlexFS. In fact, I’d love to see a few of them: FlexFS, MountPoint, etc
Lastly, I couldn’t get the privacy policy to load on your site (I’m on mobile if that helps)
btw, thanks a bunch for answering my Q & everyone else's too (except for parts where you couldn't talk about the implementation, understandably so). Appreciate it. Wishing the best.
[1] https://www.postgresql.org/docs/current/creating-cluster.htm...
All of that costs money and time. You're probably better off using cloud hosting and focusing on your unique offering than having that expertise and coordination in house.
1. fsync, fdatasync - synchronize a file's in-core state with
storage device