* In terms of a high-performance AI-focused S3 competitor, how does this compare to NVIDIA's AIstore? https://aistore.nvidia.com/
* What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?
* You spend a lot of time talking about performance; do you have any benchmarks?
* Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?
* How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?
* Are there any front-end or admin tools (and screenshots)?
* Can a cluster scale horizontally or only vertically (ie Minio)
* Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?
* Is there any telemetry?
* Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?
* Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?
* Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?
Thanks!
How does that compare to something like JuiceFS.
And in "Why Not Just Use a Filesystem?", the answer they gave is "the line is already blurring" and "industry is converging".
The line maybe blurring but as mentioned is still a clear cut use case for file system - or if higher access speed is warranted, just slap more RAM to the system and cache them. It will still cost less even at current cost of RAM.
Why not use any of the great KV stores out there? Or a traditional database even.
People use object storage for the low cost, not because it is a convenient abstraction. I suspect some people use the faster expensive S3 simply as a stopgap. Because they started with object storage, the requirements changed, it is no longer the right tool for the job but it is a hassle to switch, and AWS is taking advantage of their situation. I suppose that offering an alternative to those people for a non-extortionate price is a decent business model, but I am not sure how big that market is or how long it will last. And it's not really a question of better tech, I'm sure AWS could make it a lot cheaper if they wanted to.
But object storage at the price of a database with the performance of a database, is just a database, and I doubt that quickly reinventing that wheel yielded anything too competitive.
ps : there are actually other faster and more secure options than io-uring but I won't spoil ;)
The blog argues that AI workloads are bottlenecked by latency because of 'millions of small files.' But if you are training on millions of loose 4KB objects directly from network storage, your data pipeline is the problem, not the storage layer.
Data Formats: Standard practice is to use formats like WebDataset, Parquet, or TFRecord to chunk small files into large, sequential blobs. This negates the need for high-IOPS metadata operations and makes standard S3 throughput the only metric that matters (which is already plentiful).
Caching: Most high-performance training jobs hydrate local NVMe scratch space on the GPU nodes. S3 is just the cold source of truth. We don't need sub-millisecond access to the source of truth, we need it at the edge (local disk/RAM), which is handled by the data loader pre-fetching.
It seems like they are building a complex distributed system to solve a problem that is better solved by tar -cvf
I'm curious about one aspect though. The price comparison says storage is "included," but that hides the fact that you only have 2TB on the suggested instance type, bringing the storage cost to $180/TB/mo if you pay each year up-front for savings, $540/TB/mo when you consider that the durability solution is vanilla replication.
I know that's "double counting" or whatever, but the read/write workloads being suggested here are strange to me. If you only have 1875GB of data (achieved with 3 of those instances because of replication) and sustain 10k small-object (4KiB) QPS as per the other part of the cost comparison, you're describing a world where you read and/or write 50x your entire storage capacity every month.
I know there can be hot vs cold objects or workloads where most data is transient, but even then that still feels like a lot higher access amplification than I would expect from most workloads (or have ever observed in any job I'm allowed to write about publicly). With that in mind, the storage costs themselves actually dominate, and you're at the mercy of AWS not providing any solution even as cheap as 6x the cost of a 2-year amortized SSD (and only S3 comes close -- it's worse when you rent actual "disks," doubly so when they're high-performance).
I wonder in that's why it's all over the place. Meta engine written in Zig, okay, do I need to care? Gateway in Rust... probably a smart choice, but why do I need to be able to pick between web frameworks?
> Most object stores use LSM-trees (good for writes, variable read latency) or B+ trees (predictable reads, write amplification). We chose a radix tree because it naturally mirrors a filesystem hierarchy
Okay, so are radix tree good for write, and reads, bad for both, somewhere in between?
What is "physiological logging"?
Is that not accurate?
I could only find references to this in database systems course notes, which may indicate something.
Every generation seems to have to learn the lesson about batching small inputs together to keep throughput up.
That doesn't work on Parquet or anything compressed. In real-time analytics you want to load small files quickly into a central location where they can be both queried and compacted (different workloads) at the same time. This is hard to do in existing table formats like Iceberg. Granted not everyone shares this requirement but it's increasingly important for a wide range of use cases like log management.
The tar -cvf is a good analogy though, are you working with a virtual tape drive or a virtual SSD.
We eliminated MinIO on vSAN in lieu of ObjectScale for on prem.
A lot of the high performance S3 alternatives trumpet crazy IOPS numbers, but the devil is in how they handle metadata and consistency. FractalBits says it offers strong consistency and atomic rename ([Why We Built Another Object Storage (And Why It's Different)](https://fractalbits.com/blog/why-we-built-another-object-sto...)), which makes it different from most eventual consistency S3 clones. That implies a full‑path indexing metadata engine (something they mention in a LinkedIn post). That’s a really interesting direction because it potentially avoids some of the inode bottlenecks you see in Ceph and MinIO.
BUT the real question for me is long‑term sustainability. Running your own object store is a commitment. Who's maintaining it when the original team moves on? It's great to see new entrants with ideas, ALSO it would be reassuring if there were clear governance and a non‑profit steward at some point.
I don't mind if something uses AI to draft marketing copy... as long as the code is readable, reviewed, and licensed in a way that keeps it open. The space is crowded, and differentiation often comes down to the less flashy stuff: operational tooling, monitoring, easy deployment across zones, and how it fails. I'm curious to see where this one goes.
There are solutions for this, but the added complexity is big. In any case, your training code and data storage become tightly coupled. If you can avoid it by having a faster storage solution, at least I would be highly appreciative of it.
I’ve spent a bunch of time analyzing IBM’s publicly released Cloud Object Storage traces. Median object size is about 16K, mean is a megabyte or two. A decent number of tenants have mean object sizes under 100K.
People use object storage for a bunch of reasons. In general you’re better off supporting what your users are doing than demanding that they rewrite their applications because you think they’re doing it all wrong.
In this case it’s more competition. Good for us the consumer.
If you are confident with your work, you should not open your source because that’s the single leverage you have.
We do this with tiered storage over S3 using HopsFS that has a HDFS API with a FUSE client, so training can just read data (from HopsFS datanode's NVMe cache) as if it is local, but it is pulled from NVMe disks over the network. In contrast, writes go straight to S3 vis HopsFS write-through NVMe cache.
Their internals (in zig, actually interesting functional part) are proprietary.
What is open source is client side to access it from rust.
Did I get it right?
1. Small Objects at Scale
2. Latency Sensitivity
3. The Need for Directories
I’m skeptical on the last one. They talk about rename performance as being the issue.
I think what they mean is if you use path as the object key, if you rename a directory in the middle of a path, you need rename every object key that uses it.
But to me that is just a poor usage of an object store. You should never “rename” object keys.
Consider how git does it. If you rename a directory and diff it, the underlying object store didn’t rename any key. In fact all the files in the object stores are unchanged. Only the tree file changed, which maps paths to file hashes.
While renames would get faster that way, it would increase latency to do a path to object key look up.
I would like to see how fundamental the requirement to have directories are to AI workflows. I suspect it’s human “but I’m used to it” requirement
In my experience, it's not that directories are inherently important, it's that an organization mechanism is, in the service of a few key problems:
1. Privacy and data handling requirements
2. Versioning
3. Partitioning
4. Probably some others I'm forgetting
Hierarchical storage is a useful all-purpose tool for these things.
With S3 you just do an http request and you’re done.
A lot of folks get hung up on the theoretical equivalence of things, and forget that their favorite solution may be flat out unworkable in practice for reasons that have nothing to do with the theoretical features they’re talking about.
> NVIDIA's AIstore
AFAIK, AIstore is designed as a caching/proxy layer, but FractalBits is a ground-up object storage implementation with a custom metadata engine (Fractal ART - an on-disk adaptive radix tree). AIstore also seems to focus more on the training data pipeline (prefetch, shuffle, ETL) while we focus on the storage layer performance itself. One thing I am aware of is the (folder) rename operation, which would be a little tricky for caching/proxy layer. I'd like to do more research on the detailed comparison and update in our website. Thanks for mention this.
> What's the clustering story? Is it complex like ceph, requires K8s like AIstore for full functionality, or is it more flexible like Garage, Minio, etc?
You can check our arch doc [1] for the clustering. Right now we are focusing on the cloud (aws) environment, and you can simply type `just deploy create-vpc` for deployment, which is calling cdk underneath to set up all the clustering stuff. You can also run low level cdk commands for customization. We are also working on the k8s deployment for on-prem environment.
> You spend a lot of time talking about performance; do you have any benchmarks?
Yes, published in the README[2] with reproducible setup: - GET: 982K IOPS, 3.8 GiB/s throughput, 2.9ms avg latency, 5.3ms p99 - PUT: 248K IOPS, 970 MiB/s throughput, 6.6ms avg latency Configuration: 4KB objects, 14x c8g.xlarge (API), 6x i8g.2xlarge (BSS), 1x m7gd.4xlarge (NSS), 3-way data replication. Cost: ~$8/hour on-demand. You can reproduce this yourself with `just deploy create-vpc --template perf_demo --with-bench` and run the included benchmark scripts.
> Obviously most of the page was written by ChatGPT: what percentage of the code was written by AI, and has it been reviewed by a human?
Yeah, some blog content was generated by Gemini. For code, I haven't used AI until this September and you can check from the git commit history. The core Fractal ART engine and the io_uring integration are hand-crafted for performance.
I am also learning to work with AI (spec driven). One thing I notice is that AI really made the performance related experiments more efficient than before (framework axum->actix-web, io_uring based rpc, rust arena allocator etc.). I use my nvim editor with customized setup to review every line of code written by AI.
> How does the object storage itself work? How is it architected? Do you DHT, for example? What tradeoffs are there (CAP, for example) vs the 1.4 gazillion alternatives?
Architecture[1]: Not DHT-based. We use a centralized metadata service (NSS) with a custom on-disk Adaptive Radix Tree called Fractal ART. Key design choices: - Full-path naming: Avoids distributed transactions for directory operations - Quorum replication: N/R/W configurable (default 3-way data, 6-way metadata) - Two-tier storage: <1MB objects on local NVMe, larger objects tier to S3 backend
CAP tradeoffs: We prioritize CP (consistency + partition tolerance). Strong consistency via quorum writes. HA for NSS is under testing.
> Are there any front-end or admin tools (and screenshots)?Admin UI: Yes, we have a web-based UI (React/Ant Design) for bucket browsing, object management, and access key configuration. Currently basic but functional. Screenshots aren't published yet, but it's bundled with deployments. CLI tooling via standard AWS CLI works out of the box since we're S3-compatible.
> Can a cluster scale horizontally or only vertically (ie Minio)
Scaling: Horizontal for data nodes (BSS) - add more instances, data distributes across them with quorum with data-volume and metadata-volume. API servers are stateless and horizontally scalable. NSS (metadata) is currently single-node but designed for future horizontal sharding(split/fractal) (on roadmap: "support hundreds of NSS and BSS instances"). Different from MinIO's per-server erasure sets - we have proper quorum-based distribution.
> Why not instead just fork a previous version of Minio and then put a high-speed metadata layer on top?
Two reasons: 1. License: MinIO switched to AGPL, then to a more restrictive license. Apache 2.0 was important to us. 2. Architecture: We have a fundamentally different architecture to solve the minio's inherent scalability issue [3]. Bolting these onto MinIO would be a rewrite anyway.
> Is there any telemetry?
Yes, but not fully polished (cloudwatch setup has been commented out). We are using rust crates for tracing & metrics. Currently working on distributed tracing capabilities, and we have embedded trace_id in all our rpc headers. Once it is stabilized, we'd like to update it in the docs.
> Although it doesn't matter as much for my use case as for others, what is the specific jurisdiction of origin?
The company (FractalBits Labs) is incorporated in the United States. The BYOC model means your data stays in your cloud account in your chosen AWS region - we never see your data.
> Is there a CLA and does that CLA involve assigning rights like copyright (helps prevent the 'rug-pull' closing-source scenario)?
CLA: Currently no CLA - contributions are under Apache 2.0 via the standard "inbound = outbound" model. We don't require copyright assignment. This makes a rug-pull harder since we can't relicense without contributor consent. Happy to discuss a more formal CLA if the community prefers one.
> Is there a non-profit Foundation, goal for CNCF sponsorship or other trusted third-party to ensure that the software remains open source (although forks of prior versions mostly mitigates that concern)?
Foundation/CNCF: Not yet, but it's something we're open to as the project matures (Apache/CNCF/LF AI & Data Foundation). For now, the Apache 2.0 license provides baseline protection - any prior version remains open source regardless of future decisions. We'd welcome community input on governance structure as adoption grows. On the other hand, I (with a partner) have been full-time working on this for a whole year, and would also welcome conversations with anyone interested in supporting the project's growth.
[1] https://github.com/fractalbits-labs/fractalbits-main/blob/ma... [2] https://github.com/fractalbits-labs/fractalbits-main/blob/ma... [3] https://github.com/minio/minio/issues/7986
- AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.
- If you truly need "arbitrary subsetting" without downloading a whole tarball, formats like Parquet or indexed TFRecords allow HTTP Range Requests. You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.
> are you working with a virtual tape drive or a virtual SSD.
Treating a networked object store like a local SSD ignores the Fallacies of Distributed Computing. You cannot engineer away the speed of light or the TCP stack.
AI training is a bandwidth problem, not a latency problem. GPUs need to be fed at 10GB/s+. Making millions of small HTTP requests introduces massive overhead (headers, SSL handshakes, TTFB) that kills bandwidth. Even if the storage engine has 0ms latency, the network stack does not.
Agree that throughput is more of an issue than latency, as you can queue data to CPU memory. Small object throughput is definitely an issue though, which is what I was talking about. Also, there's no need to use HTTP for your requests, so HTTP or TLS overheads are more of self-induced problems of the storage system itself.
You can fetch specific byte ranges from a large blob without "coupling" the storage layout significantly.
This has exact same throughput problems as small objects though.
(By the way: NVIDIA AIstore is NOT a proxying/caching engine, although it can, which is somewhat unique among these types of stores. AIstore is actually a full S3 engine in its own right, and it's actually extremely capable, although live cluster resizing and ETL requires k8s :( )
If the storage is farther away, then you'll go slower of course. But since the article is comparing to EFS and S3 Express, it's fitting to talk about a nearby scenarios I think. And the point of the article was that S3 Express was more problematic for cost than small-object performance reasons.
Yes, it _can_ be configured to act as a cache in front of any of the four supported Clouds but (a) we've never done so at NVIDIA, and (b) its primary function always was - and remains - reliable, resilient storage for AI/ML workloads.
For details: