Since we're being pedantic, there is some confusion of ideas here (even though you do make a valid overall point), and the strawman may not be as ridiculous.
First, I think when you say FFT, you mean DFT. A Fourier transform is both non-discrete and infinite in time. A DTFT (discrete time fourier transform) is discrete, i.e. using samples, but infinite. A DFT (discrete fourier transform) is both finite (analyzed data has a start and an end) and discrete. An FFT is effectively an implementation of a DFT, and there is nothing indicating to me that hearing is in any way specifically related to how the FFT computes a DFT.
But more importantly, I'm not sure DFT fits at all? This is an analog, real-world physical process, so where is it discrete, i.e. how does the ear capture samples?
I think, purely based upon its "mode", what's happening is more akin to a Fourier series, which is the missing fourth category completing (FT, DTFT, DFT): Continuous (non-discrete), but finite or rather periodic in time.
But secondly, unlike Gabor transforms, wavelet transforms are specifically not just windowed Fourier anythings (whether FT/FS/DFT/DTFT). Those would commonly be called "short-time Fourier transforms" (STFT, existing again in discrete and non-discrete variants), and the article straight up mentions that they don't fit either in its footnotes.
Wavelet transforms use an entirely different shape (e.g. a haar wavelet) that is shifted and stretched for analysis, instead of windowed sinusoids over a windowed signal.
And I think those distinctions are what the article actually wanted to touch upon.