An intro to DeepSeek's distributed file system

(maknee.github.io)

623 points sebg | 1 comments | 17 Apr 25 12:50 UTC | HN request time: 0.204s | source

Show context

londons_explore ◴[17 Apr 25 13:45 UTC] No.43716756[source]▶

This seems like a pretty complex setup with lots of features which aren't obviously important for a deep learning workload.

Presumably the key necessary features are PB's worth of storage, read/write parallelism (can be achieved by splitting a 1PB file into say 10,000 100GB shards, and then having each client only read the necessary shards), and redundancy

Consistency is hard to achieve and seems to have no use here - your programmers can manage to make sure different processes are writing to different filenames.

replies(2): >>43717054 #>>43717178 #

threeseed ◴[17 Apr 25 14:14 UTC] No.43717178[source]▶

>>43716756 #

> Consistency is hard to achieve and seems to have no use here

Famous last words.

It is very common when operating data platforms like this at this scale to lose a lot of nodes over time especially in the cloud. So having a robust consistency/replication mechanism is vital to making sure your training job doesn't need to be restarted just because the block it needs isn't on the particular node.

replies(2): >>43717285 #>>43720737 #

1. ted_dunning ◴[17 Apr 25 18:51 UTC] No.43720737[source]▶

>>43717178 #

Sadly, these are often Famous First words.

What follows is a long period of saying "see, distributed systems are easy for genius developers like me"

The last words are typically "oh shit", shortly followed oxymoronically by "bye! gotta go"

↑