> If you want to use CUDA as a simple example,
When was the last time you tried this?
If your "CUDA" needs are pytorch, tensorflow, whatever, pip install (or uv pip install) and you're good to go.
When was the last time you even needed to? If you need to do actual kernel writing and thus actually need CUDA (this is pretty uncommon and I think most people that do that wouldn't be asking this question), then most of the issues are not actually issues.
I'll give an example of my latest CUDA error. I run EndeavourOS (Arch based) and so yes, using bleeding edge drivers. Did an update, reboot, oh no... I get to lock screen, login and black screen (but cursor).[0] What's the solution? Roll back cuda. Didn't work? Roll back kernel. Now it works. The problem? nvidia-560-35.03-9 was incompatible with kernel 6.11. I even was able to find in the forums (quickly) the exact issue[1].
But why am I saying this is no biggie? Well... I'm fucking running 560 drivers, which are beta. If you worry about these issues, don't. If you don't want that power, don't run Arch, Gentoo, or other bleeding edges. You know the most confusing part of this all? Was people posting their driver versions with `inxi -G` and so you only see `560.35.03` but I had to roll back `560.35.03-9` to `560.35.03-6`. But also, Nvidia could be better about their namings.
I will also concede that there is a lot of shit information out there and actually parsing what the real answers are takes experience. So here's my advice when you run into your next issue:
Getting Information:
- Start with journalctl and dmesg (try `journalctl -b -p 3` and `dmesg -L -l "err+"`. `-b` is only messages since last boot and the other flags are to only give you errors or worse). These are your "logs"
- There are others, and they *should* go under `/var/log` but just like in OSX how random junk goes to {~,/}Library/{Caches,Application Support}
- Check versions, especially if you did an update
- (side note): For all those confused where files should go, try `man hier`
- Good chance you can get through by reading the man page, but this doesn't always apply
- also remember you can do `man 7 man` or `man man.7` (replace second man with any command). Also see `man -a man`
- Don't know what man page you need? Try `man --regex cuda`
- Visit the Arch Wiki (even if you're not on Arch) -- maybe even the Gentoo Wiki. RedHat docs are also pretty good
- After that, try your distro's (or their parent's) forums.
- Archwiki is good, Arch forums are a toxic hellhole occupied by people who's idea of grass is entirely contrived from what is visible on a screen. Use the forums of the children. I'm sorry to those who've experienced that place.
- Then try Google, focusing on things from your logs. This would be up higher, but you can put quotes around things or dates and Google will outright disrespect you now)
- If it is a specific program that looks to be the issue, try the Git{Hub,Lab} issues page too. Feel free to open an issue. Most devs are pretty nice, even to noobs, though there are also many who will insinuate you RTFM after quoting and linking to it. I'm also sorry about this.
Solving issues:
- First try rolling back. If you're not messing with your system, this can make most problems go away VERY quickly.
- If you're on a rolling release distro (like Arch) then this is your goto. Unless you like problem solving. But then why are you on Arch?
- With `pacman` this can usually be done quickly with `pacman -U file:///var/cache/pacman/pkg/thing-you-want`. You can use other tools, but this is good to know, and you know where things cache :) (`downgrade` is the common tool but it just does this) You can even do kernels this way!
- Things like `timeshift` are useful (and the `pacman` or `apt` "autosnap"). But beware if you aren't using `grub` to just not do that option. Also check out `btrfs`
- If need to reinstall an old kernel and it isn't in your cache check out the command `reinstall-kernels` (try `cat /usr/bin/reinstall-kernels`). This is a uncommon task and might only be because you've filled up `/efi` and deleted a kernel.
- Stop fucking with the kernel if you don't know what you're doing. 99% of the time this is ***NOT*** the solution[2]
- For nvidia you might want `nvidia_dr.modeset=1` and ***maybe*** (probably not) `nvidia_drm.fbdev=1`
- Use `find` and `grep`.
- I'm not joking, `find` is a crazy powerful tool and people sleep on it. (Seriously, how do people jump into large codebases blind and get running without `find`, `grep`, `awk`, and such tools?)[3]
But honestly, you'll need to do none of this stuff if you're on a "baby" distro. I very much welcome people to become more experienced at linux but not everyone needs to be and there's no issue with using a distro that holds your hand (OSX and Windows do). But I would strongly encourage any programmer (not just linux users) to become more familiar with the cli. There's an investment cost, but you'll reap >10x rewards from these efforts, even in general programming situations.
[0] For the fun of it, I asked GPT and gave it logs from journal and dmesg, it did not get the answer, and listening to it would have sent me down a rabbit hole where I'd be messing with the kernel (I use systemd and dracut, these were communicated to GPT and it was asking me to run mkinitcpio and mess with grub lol)
[1] https://forum.endeavouros.com/t/only-black-screen-after-logi...
And hey look, an update: https://forum.endeavouros.com/t/attention-nvidia-gpu-driver-...
[2] For me `/etc/kernel/cmdline` looks pretty much like `nvme_load=YES nowatchdog rw root=UUID=<that> resume=UUID=<blahh> nvidia_drm.modeset=1 nvidia_drm.fbdev=1` It should be short
[3] Here's a free one for you. Got a python project and you forgot to place `__init__.py` in the folders? `find src -type d -exec touch "{}/__init__.py" \;` (replace `src` with your root source directory)