Most active commenters
  • kragen(6)
  • anyfoo(5)
  • xeonmc(3)
  • IshKebab(3)

←back to thread

How the cochlea computes (2024)

(www.dissonances.blog)
475 points izhak | 56 comments | | HN request time: 0.003s | source | bottom
1. edbaskerville ◴[] No.45762928[source]
To summarize: the ear does not do a Fourier transform, but it does do a time-localized frequency-domain transform akin to wavelets (specifically, intermediate between wavelet and Gabor transforms). It does this because the sounds processed by the ear are often localized in time.

The article also describes a theory that human speech evolved to occupy an unoccupied space in frequency vs. envelope duration space. It makes no explicit connection between that fact and the type of transform the ear does—but one would suspect that the specific characteristics of the human cochlea might be tuned to human speech while still being able to process environmental and animal sounds sufficiently well.

A more complicated hypothesis off the top of my head: the location of human speech in frequency/envelope is a tradeoff between (1) occupying an unfilled niche in sound space; (2) optimal information density taking brain processing speed into account; and (3) evolutionary constraints on physiology of sound production and hearing.

replies(12): >>45763026 #>>45763057 #>>45763066 #>>45763124 #>>45763139 #>>45763700 #>>45763804 #>>45764016 #>>45764339 #>>45764582 #>>45765101 #>>45765398 #
2. AreYouElite ◴[] No.45763026[source]
Do you believe it might be possible that the frequency band of human speech is not determined by such factors at all but more of a function of height? kids have higher voices adults have deeper voices. Similar to stringed instruments: viola high pitched and bass low pitched.

I'm no expert in these matters just speculating...

replies(1): >>45763519 #
3. matthewdgreen ◴[] No.45763057[source]
If you take this thought process even farther, specific words and phonemes should occupy specific slices of the tradeoff space. Across all languages and cultures, an immediate warning that a tiger is about to jump on you should sit in a different place than a mother comforting a baby (which, of course, it does.) Maybe that even filters down to ordinary conversational speech.
4. xeonmc ◴[] No.45763066[source]
Analogy: when you knock on doors, how do you decide what rhythm and duration to use, so that it won’t be mistaken as accidentally hitting the door?
replies(1): >>45763078 #
5. toast0 ◴[] No.45763078[source]
Shave and a haircut is the only option in my knocking decision tree.
replies(2): >>45763436 #>>45765777 #
6. a-dub ◴[] No.45763124[source]
> At high frequencies, frequency resolution is sacrificed for temporal resolution, and vice versa at low frequencies.

this is the time-frequency uncertainty principle. intuitively it can be understood by thinking about wavelength. the more stretched out the waveform is in time, the more of it you need to see in order to have a good representation of its frequency, but the more of it you see, the less precise you can be about where exactly it is.

> but it does do a time-localized frequency-domain transform akin to wavelets

maybe easier to conceive of first as an arbitrarily defined filter bank based on physiological results rather than trying to jump directly to some neatly defined set of orthogonal basis functions. additionally, orthogonal basis functions cannot, by definition, capture things like masking effects.

> A more complicated hypothesis off the top of my head: the location of human speech in frequency/envelope is a tradeoff between (1) occupying an unfilled niche in sound space; (2) optimal information density taking brain processing speed into account; and (3) evolutionary constraints on physiology of sound production and hearing.

(4) size of the animal.

notably: some smaller creatures have supersonic vocalization and sensory capability, sometimes this is hypothesized to complement visual perception for avoiding predators, it also could just have a lot to do with the fact that, well, they have tiny articulators and tiny vocalizations!

replies(1): >>45764189 #
7. SoftTalker ◴[] No.45763139[source]
Ears evolved long before speech did. Probably in step with vocalizations however.
replies(2): >>45763201 #>>45766479 #
8. Sharlin ◴[] No.45763201[source]
Not sure about that; I'd guess that vibration-sensing organs first evolved to sense disturbances (in water, on seafloor, later on dry ground and in air) caused by movement, whether of a predator, prey, or a potential mate. Intentional vocalizations for signalling purposes then evolved to utilize the existing modality.
9. cnity ◴[] No.45763436{3}[source]
Thanks for giving your two bits on the matter.
10. fwip ◴[] No.45763519[source]
It's not height, but vocal cord length and thickness. Longer vocal cords (induced by testosterone during puberty) vibrate more slowly, with a lower frequency/pitch.
11. FarmerPotato ◴[] No.45763700[source]
Is that an human understanding or is it just an AI that read the text and ignored the pictures?

Why do we need a summary in a post that adds nothing new to the conversation?

replies(1): >>45763787 #
12. pests ◴[] No.45763787[source]
Are you saying your parent post was an AI summary? There is original speculation at the end and it didn’t come off that way to me.
13. dsp_person ◴[] No.45763804[source]
Even if it is doing a wavelet transform, I still see that as made of Fourier transforms. Not sure if there's a good way to describe this.

We can make a short-time fourier transform or a wavelet transform in the same way either by:

- filterbank approach integrating signals in time

- take fourier transform of time slices, integrating in frequency

The same machinery just with different filters.

14. psunavy03 ◴[] No.45764016[source]
> A more complicated hypothesis off the top of my head: the location of human speech in frequency/envelope is a tradeoff between (1) occupying an unfilled niche in sound space; (2) optimal information density taking brain processing speed into account; and (3) evolutionary constraints on physiology of sound production and hearing.

Well from an evolutionary perspective, this would be unsurprising, considering any other forms of language would have been ill-fitted for purpose and died out. This is really just a flavor of the anthropic principle.

15. Terr_ ◴[] No.45764189[source]
> it also could just have a lot to do with the fact that, well, they have tiny articulators and tiny vocalizations!

Now I'm imagining some alien shrew with vocal-cords (or syrinx, or whatever) that runs the entire length of its body, just so that it can emit lower-frequency noises for some reason.

replies(3): >>45764677 #>>45765058 #>>45768799 #
16. lgas ◴[] No.45764339[source]
> It does this because the sounds processed by the ear are often localized in time.

What would it mean for a sound to not be localized in time?

replies(4): >>45764466 #>>45764524 #>>45764731 #>>45768788 #
17. littlestymaar ◴[] No.45764466[source]
A continuous sinusoidal sound, I guess?
18. hansvm ◴[] No.45764524[source]
It would look like a Fourier transform ;)

Zooming in to cartoonish levels might drive the point home a bit. Suppose you have sound waves

  |---------|---------|---------|
What is the frequency exactly 1/3 the way between the first two wave peaks? It's a nonsensical question. The frequency relates to the time delta between peaks, and looking locally at a sufficiently small region of time gives no information about that phenomenon.

Let's zoom out a bit. What's the frequency over a longer period of time, capturing a few peaks?

Well...if you know there is only one frequency then you can do some math to figure it out, but as soon as you might be describing a mix of frequencies you suddenly, again, potentially don't have enough information.

That lack of information manifests in a few ways. The exact math (Shannon's theorems?) suggests some things, but the language involved mismatches with human perception sufficiently that people get burned trying to apply it too directly. E.g., a bass beat with a bit of clock skew is very different from a bass beat as far as a careless decomposition is concerned, but it's likely not observable by a human listener.

Not being localized in time means* you look at longer horizons, considering more and more of those interactions. Instead of the beat of a 4/4 song meaning that the frequency changes at discrete intervals, it means that there's a larger, over-arching pattern capturing "the frequency distribution" of the entire song.

*Truly time-nonlocalized sound is of course impossible, so I'm giving some reasonable interpretation.

replies(1): >>45764999 #
19. patrickthebold ◴[] No.45764582[source]
I think I might be missing something basic, but if you actually wanted to do a Fourier transform on the sound hitting your ear, wouldn't you need to wait your entire lifetime to compute it? It seems pretty clear that's not what is happening, since you can actually hear things as they happen.
replies(4): >>45764633 #>>45764755 #>>45764761 #>>45764952 #
20. xeonmc ◴[] No.45764633[source]
You’ll also need to have existed and started listening before the beginning of time, forever and ever. Amen.
21. bragr ◴[] No.45764677{3}[source]
Well without the humorous size difference, this is basically what whales and elephants do for long distance communication.
replies(1): >>45765641 #
22. xeonmc ◴[] No.45764731[source]
Means that it is a broad spectrum signal.

Imagine the dissonant sound of hitting a trashcan.

Now imagine the sound of pressing down all 88 keys on a piano simultaneously.

Do they sound similar in your head?

The localization is located at where the phase of all frequency components are aligned coherently construct into a pulse, while further down in time their phases are misaligned and cancel each other out.

23. cherryteastain ◴[] No.45764755[source]
Not really, just as we can create spectrograms [1] for a real time audio feed without having to wait for the end of the recording by binning the signal into timewise chunks.

[1] https://en.wikipedia.org/wiki/Spectrogram

replies(1): >>45764961 #
24. bonoboTP ◴[] No.45764761[source]
Yes, for the vanilla Fourier transform you have to integrate from negative to positive infinity. But more practically you can put put a temporally finite-support window function on it, so you only analyze a part of it. Whenever you see a 2d spectrogram image in audio editing software, where the audio engineer can suppress a certain range of frequencies in a certain time period they use something like this.

It's called the short-time Fourier transform (STFT).

https://en.wikipedia.org/wiki/Short-time_Fourier_transform

replies(1): >>45768746 #
25. IshKebab ◴[] No.45764952[source]
Yes exactly. This is a classic "no cats and dogs don't actually rain from the sky" article.

Nobody who knows literally anything about signal processing thought the ear was doing a Fourier transform. Is it doing something like a STFT? Obviously yes and this article doesn't go against that.

26. IshKebab ◴[] No.45764961{3}[source]
Those use the Short-Time Fourier Transform, which is very much like what the ear does.

https://en.wikipedia.org/wiki/Short-time_Fourier_transform

replies(1): >>45766435 #
27. jancsika ◴[] No.45764999{3}[source]
> It's a nonsensical question.

Are you talking about a discrete signal or a continuous signal?

28. Y_Y ◴[] No.45765058{3}[source]
Sounds like an antenna, if you'll accept electromagnetic noise then there are some fish that could pass for your shrew, e.g. https://en.wikipedia.org/wiki/Gymnotus
29. km3r ◴[] No.45765101[source]
> one would suspect that the specific characteristics of the human cochlea might be tuned to human speech while still being able to process environmental and animal sounds sufficiently well.

I wonder if these could be used to better master movies and television audio such that the dialogue is easier to hear.

replies(1): >>45765132 #
30. kiicia ◴[] No.45765132[source]
You are expecting too much, we still have no technology to do that, unless it’s about clarity of advertisement jingles /s
31. crazygringo ◴[] No.45765398[source]
Yeah, this article feels like it's very much setting up a ridiculous strawman.

Nobody who knows anything about signal processing has ever suggested that the ear performs a Fourier transform across infinite time.

But the ear does perform something very much akin to the FFT (fast Fourier transform), turning discrete samples into intensities at frequencies -- which is, of course, what any reasonable person means when they say the ear does a Fourier transform.

This article suggests it's accomplished by something between wavelet and Gabor. Which, yes, is not exactly a Fourier transform -- but it's producing something that is about 95-99% the same in the end.

And again, nobody would ever suggest the ear was performing the exact math that the FFT does, down to the last decimal point. But these filters still work essentially the same way as the FFT in terms of how they respond to a given frequency, it's really just how they're windowed.

So if anyone just wants a simple explanation, I would say yes the ear does a Fourier transform. A discrete one with windowing.

replies(3): >>45766343 #>>45767588 #>>45768701 #
32. Terr_ ◴[] No.45765641{4}[source]
Was playing around with a fundamental frequency calculator [0] to associate certain sizes to hertz, then using a tone-generator [1] to get a subjective idea of what it'd sound like.

Though of course, nature has plenty of other tricks, like how Koalas can go down to ~27hz. [2]

[0] https://acousticalengineer.com/fundamental-frequency-calcula...

[1] https://www.szynalski.com/tone-generator/

[2] https://www.nature.com/articles/nature.2013.14275

replies(1): >>45766503 #
33. throwaway198846 ◴[] No.45765777{3}[source]
... What does that mean?
replies(1): >>45765925 #
34. crazygringo ◴[] No.45765925{4}[source]
https://en.wikipedia.org/wiki/Shave_and_a_Haircut
35. anyfoo ◴[] No.45766343[source]
Since we're being pedantic, there is some confusion of ideas here (even though you do make a valid overall point), and the strawman may not be as ridiculous.

First, I think when you say FFT, you mean DFT. A Fourier transform is both non-discrete and infinite in time. A DTFT (discrete time fourier transform) is discrete, i.e. using samples, but infinite. A DFT (discrete fourier transform) is both finite (analyzed data has a start and an end) and discrete. An FFT is effectively an implementation of a DFT, and there is nothing indicating to me that hearing is in any way specifically related to how the FFT computes a DFT.

But more importantly, I'm not sure DFT fits at all? This is an analog, real-world physical process, so where is it discrete, i.e. how does the ear capture samples?

I think, purely based upon its "mode", what's happening is more akin to a Fourier series, which is the missing fourth category completing (FT, DTFT, DFT): Continuous (non-discrete), but finite or rather periodic in time.

But secondly, unlike Gabor transforms, wavelet transforms are specifically not just windowed Fourier anythings (whether FT/FS/DFT/DTFT). Those would commonly be called "short-time Fourier transforms" (STFT, existing again in discrete and non-discrete variants), and the article straight up mentions that they don't fit either in its footnotes.

Wavelet transforms use an entirely different shape (e.g. a haar wavelet) that is shifted and stretched for analysis, instead of windowed sinusoids over a windowed signal.

And I think those distinctions are what the article actually wanted to touch upon.

replies(1): >>45766722 #
36. anyfoo ◴[] No.45766435{4}[source]
Yes, but the article specifically says that it isn't like a short-time fourier transform either, but more like a wavelet transform, which is different yet again.
replies(1): >>45766571 #
37. jibal ◴[] No.45766479[source]
Ears arose long before speech did. They evolved in response to changes in the environment, e.g., the existence of speech.
38. fuzzfactor ◴[] No.45766503{5}[source]
How long would a Dachshund have to be for it to sound like a 60 kilo Great Dane?
39. IshKebab ◴[] No.45766571{5}[source]
Barely different though. Obviously nobody is saying it's exactly a Fourier transform or a STFT. But it's very like a STFT (or a wavelet transform).

The article is pretty much "cows aren't actually spheres guys".

replies(2): >>45766613 #>>45768760 #
40. anyfoo ◴[] No.45766613{6}[source]
I'd say the title is like that (and I agree with someone else's assessment of it being clickbait-y). I think the actual article does a pretty good job in distinguishing a lot of these transforms, and honing into which one matches most.

But the title instead makes it sound (pun unintended) that what the ear does is not about frequency decomposition at all.

replies(1): >>45768406 #
41. actionfromafar ◴[] No.45766722{3}[source]
Don’t neurons fire in bursts? That’s sort of discrete I guess.
replies(4): >>45766907 #>>45767028 #>>45768439 #>>45768720 #
42. smallnix ◴[] No.45766907{4}[source]
I was also thinking of refractory periods with neurotransmitters. But I don't know much about this.
replies(1): >>45767032 #
43. anyfoo ◴[] No.45767028{4}[source]
Even if they do (and I honestly have no idea), isn't it the frequency, i.e. the output of the basilar membrane in the ear, and not a sample in time of the actual sound wave which would correspond to a short-time frequency transform, that gets sampled here?

And the basilar membrane seems like a pretty un-discrete (in time, not in frequency) process to me. But I'm not 100% sure.

Sure, if you go small enough, you end up with discrete structures sooner or later (molecules, atoms, quantum if you go far down enough and everything breaks apart anyway), but without knowing anything, the sensitivity of this whole process still seems better modeled as continuous rather than discrete, the scale at which that happens seems just too small to me.

replies(2): >>45767256 #>>45773173 #
44. anyfoo ◴[] No.45767032{5}[source]
It's a good question, but as elaborated in a sibling comment, I'm not sure it even matters in this case. (Sampling frequency vs. sampling the sound wave itself.)
45. a-dub ◴[] No.45767256{5}[source]
going all the way out to percept, the response of the system is non-linear: https://en.wikipedia.org/wiki/Mel_scale

this is believed to come from the shape of the cochlea, which is often modeled as a filterbank that can express this non-linearity in an intuitive way.

46. waffletower ◴[] No.45767588[source]
The article does a fair job of positing that the ear provides temporal/frequency resolution along a logarithmic scale but doesn't assert clearly that this resolution is fixed with the STFT and the Gabor variant. It hints that wavelets are more akin in terms of perceptual scaling as a function of frequency but not articulately. But it is interesting that the author's thesis, how Fourier mathematics isn't appropriate for describing human perception of sound, relates human hearing to the Gabor transform which is thoroughly a derivative of discrete Fourier mathematics.
replies(1): >>45768728 #
47. jibal ◴[] No.45768406{7}[source]
The fourth sentence in the article is "Vibrations travel through the fluid to the basilar membrane, which remarkably performs frequency separation", with the footnote

"We call this tonotopic organization, which is a mapping from frequency to space. This type of organization also exists in the cortex for other senses in addition to audition, such as retinotopy for vision and somatotopy for touch."

So the cochlea does frequency decomposition but not by performing a FT (https://en.wikipedia.org/wiki/Fourier_transform), but rather by a biomechanical process involving numerous sensors that are sensitive to different frequency ranges ... similar to how we have different kinds (only 3, or in birds and rare humans 4) of cones in the retina that are sensitive to different frequency ranges.

The claim that the title makes it sound like what the ear does is not about frequency decomposition at all is simply false ... that's not what it says, at all.

48. acjohnson55 ◴[] No.45768439{4}[source]
Yes. See the volley theory of hearing: https://en.wikipedia.org/wiki/Volley_theory
49. kragen ◴[] No.45768701[source]
> turning discrete samples into intensities at frequencies

This description applies equally well to the discrete wavelet, discrete Gabor, and maybe even Hadamard transforms, which are definitely not, as you assert, "95–99% the same in the end" (how would you even measure such similarity?) So it is not something any reasonable person has ever meant by "the Fourier transform" or even "the discrete Fourier transform".

Also, you seem to be confused about what "discrete" means in the context of the Fourier transform. The ear functions in continuous time and does not take discrete samples.

50. kragen ◴[] No.45768720{4}[source]
I think those bursts ("action potentials") happen at continuously varying times, though.
51. kragen ◴[] No.45768728{3}[source]
Many solutions to differential equations are thoroughly derived from the Fourier transform too, and so is Heisenberg's uncertainty principle. That doesn't mean they're the same thing.
52. kragen ◴[] No.45768746{3}[source]
Yeah. But a really annoying thing about the STFT is that its temporal resolution is independent of frequency, so you either have to have shitty temporal resolution at high frequencies or shitty frequency resolution at low ones, compared to the human ear. So in Audacity I keep having to switch back and forth between window sizes.
53. kragen ◴[] No.45768760{6}[source]
It's very unlike both of those, as the nice diagrams in the article explain; not only is what it is saying not obvious to you, it is apparently something you actively disbelieve.
54. kragen ◴[] No.45768788[source]
The 50-cycle hum of the transformer outside your house. Tinnitus. The ≈15kHz horizontal scanning frequency whine of a CRT TV you used to be able to hear when you were a kid.

Of course, none of these are completely nonlocalized in time. Sooner or later there will be a blackout and the transformer will go silent. But it's a lot less localized than the chirp of a bird.

55. taneq ◴[] No.45768799{3}[source]
I’m not sure exactly how, but cats can emit a surprisingly low growl when they want to. Like, as deep as a large human would be able to. So there’s more going on than just linear size… And how I’m wondering what the lowest recorded pitch made by a shrew is.
56. Balgair ◴[] No.45773173{5}[source]
Neuro person here.

Yes, many neurons fire at discrete intervals set by their morphology. In fact, this DFT/FFT/Infinite-FT/whatever-FT is all the hell over neuroscience. Many neurons don't really 'communicate' in just a single action potential. They are mostly firing at each other all the time, and the rate of firing is what communicates information. So neuron A is always popping at neuron B, but that tone/rate of popping is what affects change/information.

Now, this is not nearly true of every single neuron-neuron interaction. Some do use a single action potential (your patella knee reflex), some communicate with hundreds of other neurons (pyramidal cells in your cerebellum), some inhibit the firing of other neurons (gap/dendrite junction/axon interactions), some transmit information in opposite ways. It's a giant mess and the exact sub system is what you have to specify to get a handle on things.

Also, you get whole brain wave activity during different periods of sleep and awake cycles. So all the neurons will sync up their firing rates in certain areas when you're dreaming or taking an SAT of something. And yes, you can influence mass cyclic firing with powerful magnets (TCMS).

For the cochlea here, these hair cells are mostly firing all the time and then when a sound/frequency that they are 'tuned' to is heard, then their firing pattern changes and that information is then transmitted toward the parietal lobes. To be clear too, there are a lot of other brain structures in the way before the info gets to a place where you can be conscious of it. Things like the medial nuclei, the trapezoidal bodies, the caleyx of Held, etc. Most of these areas are for discriminating sounds and the location of sounds in space. So like when your fan is on for a long while and you no longer hear it, that's because of the other structures.