I just went through this this weekend - If you're running in Windows and want to use deepspeed, you have to still use Cuda 12.1 because deepspeed 13.1 is the latest that works with 12.1. There's no deepspeed for windows that works with 12.3.
I tried to get it working this weekend but it was a huge PITA so I switched to putting everything into WSL2 then in arch on there pytorch etc in containers so I could flip versions easily now that I know how SPECIFIC the versions are to one another.
I'm still working on that part, halfway into it my WSL2 completely broke and I had to reinstall windows. I'm scared to mount the vhdx right now. I did ALL of my work and ALL of my documentation is inside of the WSL2 archlinux and NOT on my windows machine. I have EVERYTHING I need to quickly put another server up (dotfiles, configs) sitting in a chezmoi git repo ON THE VM. That I only git committed one init like
5 mins into everything. THAT was a learning experience, now I have no idea if I should follow the "best practice" of keeping projects in wsl or having wsl reach out to windows, there's a performance drop. The 9p networking stopped working and no matter what I reinstalled, reset, removed features, reset windows, etc, it wouldn't start. But at least I have that WSL2 .vhdx image that will hopefully mount and start. And probably break WSL2 again. I even SPECIFICALLY took backups of the image as tarballs every hour in case I broke LINUX, not WSL.
If anyone has done sd containers in wsl2 already let me know. I've tried to use WSL for dev work (i use osx) like this 2-3 times in the last 4-5 years and I always run into some catastrophically broken thing that makes my WSL stop working. I hadn't used it in years so hoped it was super reliable by now. This is on 3 different desktops with completely different hardware, etc. I was terrified it would break this weekend and IT DID. At least I can be up in windows in 20 minutes thanks to chocolately and chezmoi. Wiped out my entire gaming desktop.
Sorry I'm venting now this was my entire weekend.
This repo is from a deepspeed contrib (iirc) and lists the reqs for deepspeed + windows that mention the version matches
https://github.com/S95Sedan/Deepspeed-Windows
> conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
It may sound weird to do any of this in Windows, or maybe not, but if it does just remember that it's a lot of gamers like me with 4090s who just want to learn ML stuff as a hobby. I have absolutely no idea what I'm doing but thank god I know containers and linux like the back of my hand.