An intro to DeepSeek's distributed file system

1. huntaub ◴[17 Apr 25 14:12 UTC] No.43717158[source]▶

I think that the author is spot on, there are a couple of dimensions in which you should evaluate these systems: theoretical limits, efficiency, and practical limits.

From a theoretical point of view, like others have pointed out, parallel distributed file systems have existed for years -- most notably Lustre. These file systems should be capable of scaling out their storage and throughput to, effectively, infinity -- if you add enough nodes.

Then you start to ask, well how much storage and throughput can I get with a node that has X TiB of disk -- starting to evaluate efficiency. I ran some calculations (against FSx for Lustre, since I'm an AWS guy) -- and it appears that you can run 3FS in AWS for about 12-30% cheaper depending on the replication factors that you choose against FSxL (which is good, but not great considering that you're now managing the cluster yourself).

Then, the third thing you start to ask is anecdotally, are people able to actually configure these file systems into the size of deployment that I want (which is where you hear things like "oh it's hard to get Ceph to 1 TiB/s") -- and that remains to be seen from something like 3FS.

Ultimately, I obviously believe that storage and data are really important keys to how these AI companies operate -- so it makes sense that DeepSeek would build something like this in-house to get the properties that they're looking for. My hope is that we, at Archil, can find a better set of defaults that work for most people without needing to manage a giant cluster or even worry about how things are replicated.

replies(2): >>43717307 #>>43726407 #

2. jamesblonde ◴[17 Apr 25 14:23 UTC] No.43717307[source]▶

>>43717158 (TP) #

Maybe AWS could start by making fast NVMes available - without requiring multi TB disks just to get 1 GB/s. S3FS experiments were run on 14 GB/s NVMe disks - an order of magnitude higher throughput than anything available in AWS today.

SSDs Have Become Ridiculously Fast, Except in the Cloud: https://news.ycombinator.com/item?id=39443679

replies(2): >>43719482 #>>43720293 #

3. kridsdale1 ◴[17 Apr 25 16:58 UTC] No.43719482[source]▶

>>43717307 #

On my home LAN connected with 10gbps fiber between MacBook Pro and server, 10 feet away, I get about 1.5gbps vs the non-network speed of the disks of ~50 gbps. (Bits, not bytes)

I worked this out to the macOS SMB implementation really sucking. I set up a NFS driver and it got about twice as fast but it’s annoying to mount and use, and still far from the disk’s capabilities.

I’ve mostly resorted to abandoning the network (after large expense) and using Thunderbolt and physical transport of the drives.

replies(2): >>43719740 #>>43720026 #

4. greenavocado ◴[17 Apr 25 17:18 UTC] No.43719740{3}[source]▶

>>43719482 #

Is NFS out of the question?

replies(1): >>43721201 #

5. dundarious ◴[17 Apr 25 17:44 UTC] No.43720026{3}[source]▶

>>43719482 #

SMB/CIFS is an incredibly chatty, synchronous protocol. There are/were massive products built around mitigating and working around this when trying to use it over high latency satellite links (US military did/does this).

6. __turbobrew__ ◴[17 Apr 25 18:06 UTC] No.43720293[source]▶

>>43717307 #

There are i4i instances in AWS which can get you a lot of IOPS with a smaller disk.

replies(2): >>43725223 #>>43725914 #

7. kridsdale1 ◴[17 Apr 25 19:37 UTC] No.43721201{4}[source]▶

>>43719740 #

I have set it up but it’s not easy to get drivers working on a Mac.

replies(1): >>43724823 #

8. insaneirish ◴[18 Apr 25 03:49 UTC] No.43724823{5}[source]▶

>>43721201 #

What particular drivers are you referring to? NFS is natively supported in MacOS...

replies(1): >>43725922 #

9. ashu1461 ◴[18 Apr 25 05:11 UTC] No.43725223{3}[source]▶

>>43720293 #

Are these attached directly to your server or hosted separately ?

replies(1): >>43727986 #

10. jamesblonde ◴[18 Apr 25 07:43 UTC] No.43725914{3}[source]▶

>>43720293 #

Had a look - Baseline disk throughput is 78.12 MB/s. Max throughput (30 mins/day) is 1250 MB/s.

NVMe i bought for 150 dollars with 4 TBs capacity gives me 6000 MB/s sustained

https://docs.aws.amazon.com/ec2/latest/instancetypes/so.html

replies(2): >>43728016 #>>43729920 #

11. olavgg ◴[18 Apr 25 07:45 UTC] No.43725922{6}[source]▶

>>43724823 #

That is true, though the implementation is weird.

I mount my nfs shares like this: sudo mount -t nfs -o nolocks -o resvport 192.168.1.1:/tank/data /mnt/data

-o nolocks Disables file locking on the mounted share. Useful if the NFS server or client does not support locking, or if there are issues with lock daemons. On macOS, this is often necessary because lockd can be flaky.

-o resvport Tells the NFS client to use a reserved port (<1024) for the connection. Some NFS servers (like some Linux configurations or *BSDs with stricter security) only accept requests from clients using reserved ports (for authentication purposes).

12. KaiserPro ◴[18 Apr 25 09:32 UTC] No.43726407[source]▶

>>43717158 (TP) #

the other important thing to note is what is that filesystem designed to be used for?

For example 3FS looks like its optimised for read throughput (which makes sense, like most training workloads its read heavy.) write operations look very heavy.

Can you scale the metadata server, what are the cost of metadata operations? Is there a throttling mechanism to stop a single client sucking all of the metadata server's IO? Does it support locking? Is it a COW filesystem?

13. huntaub ◴[18 Apr 25 13:39 UTC] No.43727986{4}[source]▶

>>43725223 #

i-series instances have direct-attached drives

14. sgarland ◴[18 Apr 25 13:42 UTC] No.43728016{4}[source]▶

>>43725914 #

That’s on the smallest instance. I’m sure there’s a reason they offer it, but I can’t think of why. On the largest instance (which IME is what people use with these), it’s 5000 MBps. The newer i7ie max out at 7500 MBps.

15. __turbobrew__ ◴[18 Apr 25 17:08 UTC] No.43729920{4}[source]▶

>>43725914 #

You are incorrect, the numbers you are quoted is EBS volume performance. iX instances have directly attached NVME volumes which are separate from EBS.

> NVMe i bought for 150 dollars

Sure, now cost out the rest of the server, the racks, the colocation space for racks, power, multiple AZ redundancy, a clos network fabric, network peering, the spare hardware for failures, off site backups, supply chain management, a team of engineers to design the system, a team of staff to physically rack new hardware and unrack it, a team of engineers to manage the network, on call rotations for all those teams.

Sure the NVME is just $150 bro.

replies(1): >>43741472 #

16. jamesblonde ◴[20 Apr 25 04:13 UTC] No.43741472{5}[source]▶

>>43729920 #

You claim I am incorrect, but you don't provide a reference or numbers, which I couldn't find.

replies(1): >>43741691 #

17. __turbobrew__ ◴[20 Apr 25 05:21 UTC] No.43741691{6}[source]▶

>>43741472 #

AWS doesn't provide throughput numbers for the NVME on iX instances. You have to look at benchmarks or test it out yourself. Similar to packets per second limits which are not published either and can only be inferred through benchmarks.