Surprisingly the blocker has been identifying notes from the microphone input. I assumed that'd have been a long-solved problem; just do an FFT and find the peaks of the spectrogram? But apparently that doesn't work well when there's harmonics and reverb and such, and you have to use AI models (google and spotify have some) to do it. And so far it still seems to fail if there are more than three notes played simultaneously.
Now I'm baffled how song identification can work, if even identifying notes is so unreliable! Maybe I'm doing something wrong.