Most active commenters
  • asveikau(5)
  • thaumasiotes(3)

←back to thread

Phonetic Matching

(smoores.dev)
77 points raybb | 15 comments | | HN request time: 0.53s | source | bottom
Show context
asveikau ◴[] No.42172434[source]
The idea that "shore" and "sure" are pronounced "almost identically" would depend pretty heavily on your accent. The vowel is pretty different to me.

Also, the matches for "sorI" and "sorY" would seem to me to misinterpret the words as having a vowel at the end, rather than a silent vowel. If you're using data meant for foreign surnames, the rules of which may differ from English and which might have silent vowels be very rare depending on the original language, of course you may mispronounce English words like this, saying both shore and sure as "sore-ee".

I'm sure there are much better ways to transcribe orthography to phonetics, probably people have published libraries that do it. From some googling, it seems like some people call this type of library a phonemic transcriber or IPA transcriber.

replies(5): >>42172850 #>>42173496 #>>42177414 #>>42179389 #>>42180312 #
1. woodrowbarlow ◴[] No.42173496[source]
IPA is the most-used tool by linguistic researchers for encoding pronunciation in a standardized way. IPA is criticized for being a little bit anglo-centric and falls short for some languages and edge cases, but overall it performs pretty well. (learned from an ex who studies linguistics.)
replies(4): >>42173671 #>>42174382 #>>42174781 #>>42177483 #
2. asveikau ◴[] No.42173671[source]
I've always found IPA to be deeply confusing for English, because different accents have different historical vowel mergers, so I am never sure about vowels. And I think linguists aren't always sure about them either. IIRC, I saw a video by Geoff Lindsey suggesting Americans don't really have a /ʌ/ phoneme. Most people who have written about this write as if we do. (By the way, Dr. Lindsey's YouTube videos are some of the more interesting content I've found about English phonetics)

For other languages I have exposure to, IPA seems to make more sense. Possibly I have a bias in that they're not my native language, so I can analyze them instead of internalizing them. But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.

replies(1): >>42177511 #
3. tokinonagare ◴[] No.42174382[source]
The issue is not really in the IPA but how to use it. If you stay at the phonemic level, it's makes more words comparable but hides distinctions that occurs only in dialects. Also for a lot of language, there's multiple modelization in terms of the set of phonemes involved. If you go down the phonetic rabbit hole the notation quickly become read heard to read. If you have to handle multiples variations, there's also diaphonemes but then it's even less standardized.
4. lupire ◴[] No.42174781[source]
Yes, but stay aware that IPA is for pronunciations.

A word doesn't have unique pronunciation. (Speaker, Word) pair has pronunciation, and even those are not unique. (Speaker, Word, Utterane) Triple has a pronunciation.

replies(2): >>42177401 #>>42178237 #
5. jjtheblunt ◴[] No.42177401[source]
even a speaker with a specified word in a specified utterance will vary pronunciation for the context of who is listening (imitation of local accent).

(we worked on all this in Motorola in 2001 extensively....then they dropped it)

6. bane ◴[] No.42177483[source]
This is sort of the inverse of the problem IPA is trying to solve. You're correct in that IPA is used to try to encode pronunciation. But phonetic matching is trying to encode those areas where different people, in different accents (maybe languages), say or write semantically the same thing, but differently -- but you need to find all the others using only one of the different versions without finding things that are not or irrelevant.

Basically it's trying to smush all the different versions together into a single sort of cluster, where the identity of the cluster is any of the versions.

I used to work in this field about 30 years ago, specifically how names can end up being latinized when coming from non-latin languages. We were very focused on trying to collapse variants into a complex ruleset that could be used both to recognize the cluster of names as being the same "thing", and then that ruleset could also produce all the valid variants. It was very much a kind of applied "expert systems" approach that predated ML.

The rulesets were more or less context free grammars and regular expressions that we could use to "decompile" a name token into a kind of limited regular expression (no infinite closures) and then recompile the expression back into a list of other variants. Each variant in turn was supposed to "decompile" back into the same expression so a name could be part of a kind of closed algebra of names all with the same semantic meaning.

For example:

A Korean name like "Park" might turn into a {rule} that would also generate "Pak", "Paek", "Baek", etc.

Any one of those would also generate the same {rule}.

In practice it worked surprisingly well, and the struggle was mostly in identifying the areas where the precision/recall in this scheme caused the names to not form a closed algebra.

Building the rules was an ungoldly amount of human labor though, with expert linguists involved at every step.

These days I'm sure the problem would be approached in an entirely different way.

replies(1): >>42178485 #
7. thaumasiotes ◴[] No.42177511[source]
> For other languages I have exposure to, IPA seems to make more sense.

> But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.

The first and last of those are essentially guaranteed to be false.

> Possibly I have a bias in that they're not my native language

The more likely bias is that you just don't know very much about those other languages.

replies(1): >>42177867 #
8. asveikau ◴[] No.42177867{3}[source]
You assume too much. I'm talking about languages I'm fluent in, can read and write, etc.

You'd have to be insane to think that, for example, IPA for Spanish isn't easier than IPA for English vowels. In contrast to English, most Spanish regional pronunciations are about consonants. And the orthography is very regular. If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.

replies(1): >>42180058 #
9. Funes- ◴[] No.42178237[source]
>Yes, but stay aware that IPA is for pronunciations. A word doesn't have unique pronunciation.

No. IPA encodes sounds based on various aspects of articulation. A word has unique phonemes (enclosed in forward slashes, //), but not necessarily unique sounds (allophones, enclosed in brackets, []).

10. rhdunn ◴[] No.42178485[source]
In English there was analysis by John Wells defining several lexical sets for vowels [1]. Various other linguists have extended that to cover other accents such as Scottish, Irish, and Welsh.

Words in the same lexical sets are the result of splits and mergers. Usually through processes like you describe -- e.g. the Southern English BATH vowel resulted from a lengthening of the TRAP vowel and then changing quality to that of the PALM vowel.

Lexical sets don't cover consonant changes between accents such as rhotic r, the /sh/-/sj/ merger in shore and sure, tapped t, glottalized t, etc.

And on the thing you are talking about, Colin Gorrie has some YouTube videos on doing that comparative linguistics and rule construction. A lot of his videos are doing that for conlangs, but there are videos with real historical accents in several languages.

An example in English is the shift in pronunciation of 's' before 'u' from the /sj/ glide to the /sh/ sibilant so that in accents with that shift <sure> and <shore> are homophones (especially with the CURE-FORCE merger).

There are computer programs that you can use to express these rules and see how the pronunciations change over time. I think Colin uses one in some of his videos and Biblidarion uses one in some of his conlang videos to check phonetic evolution of his conlangs.

[1] https://en.wikipedia.org/wiki/Lexical_set

replies(1): >>42180183 #
11. thaumasiotes ◴[] No.42180058{4}[source]
> You assume too much. I'm talking about languages I'm fluent in, can read and write, etc.

None of that takes anything away from what I said. The more likely bias is that you don't know much about the languages you're referring to. Whether you can read or write them doesn't even speak to these questions.

"Cleaner phonetics" doesn't have a meaning. And the idea that there's less regional variation in Spanish than English is not plausible.

> If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.

Do you usually notate that [d] and [t] in standard Spanish are generally dental rather than alveolar?

IPA is almost never used with a concern for the phonetic accuracy of the symbols. It's almost always used to indicate phonemes. You can even read John Wells arguing vehemently that correct IPA for English should use the symbols "e" and "r" because those are familiar to people who use the English alphabet.

replies(1): >>42180797 #
12. bane ◴[] No.42180183{3}[source]
Great response!

I forgot to mention that we struggled much more with cognate names that that were more linguistically distant than within ones that were closer. e.g. Matthew, Mattieu were likely to be within the same lexical set, but Matityahu may have been a bit to far for us.

It's interesting how some names tend to have more conserved features than others as they transit across larger distances in terms of language families. I worked later in genetics and was able to reapply many of the learning from names into gene sequences.

I used to test our own software with my given name. It's rather common, but our ruleset would produce some rather wild variations of it. We thought it was an error, but it turned out to be a completely valid name in Finnish!

13. asveikau ◴[] No.42180797{5}[source]
> And the idea that there's less regional variation in Spanish than English is not plausible.

I didn't say there was not variation. I said there was less of a specific type. Specifically there are not phonemic vowel variations as in English. In terms of consonants, some regions have extra consonant phonemes such as /θ/ or /λ/. Some have a tendency to omit some consonants in some positions. Every other difference is around which allophones get expressed preferentially.

> Do you usually notate that [d] and [t] in standard Spanish are generally dental rather than alveolar?

Typically I've seen it notated that way between [] but not between //. It's an articulation detail rather than phonemic.

This is honestly a not very intelligent point, you're just doing trivia on Spanish phonetics now. I know these things too. They're not at all relevant to my point.

replies(1): >>42181114 #
14. thaumasiotes ◴[] No.42181114{6}[source]
> I didn't say there was not variation. I said there was less of a specific type. Specifically there are not phonemic vowel variations as in English.

Here are your exact words:

>> But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.

>> You'd have to be insane to think that, for example, IPA for Spanish isn't easier than IPA for English vowels.

If you don't want to defend what you've said... don't just pretend that you said something completely different.

I might also ask whether you're sure that the different varieties of Spanish actually exhibit less variation in their vowels than the varieties of English do, as opposed to the impact of this variation being muted by the much smaller count of vowel phonemes.

As for this:

> I've always found IPA to be deeply confusing for English, because different accents have different historical vowel mergers, so I am never sure about vowels.

> If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.

Those two claims directly conflict with each other. The second one is more correct, in its first half. If you know your target dialect, you can produce conventional IPA for any given word. The orthography of English usually makes this easier by preserving information about the historical pronunciation of the word. If what you want is to produce IPA for an English word without knowing the dialect it's going to be pronounced in, that's no more possible in Spanish than it is in English, and you've already noted this fact.

> This is honestly a not very intelligent point, you're just doing trivia on Spanish phonetics now. I know these things too. They're not at all relevant to my point.

What point? What do you think you're complaining about, if not trivia on English phonetics?

Try articulating an actual problem with the use of IPA for English that, in your opinion, doesn't occur in every other language.

> Specifically there are not phonemic vowel variations [in Spanish] as in English. In terms of consonants, some regions have extra consonant phonemes such as /θ/ or /λ/. Some have a tendency to omit some consonants in some positions. Every other difference is around which allophones get expressed preferentially.

Going purely from Wikipedia...

> For those areas of southeastern Spain where the deletion of final /s/ is complete, and where the distinction between singular and plural of nouns depends entirely on vowel quality, it has been argued that a set of phonemic splits has occurred, resulting in a system with eight vowel phonemes in place of the standard five.

replies(1): >>42183930 #
15. asveikau ◴[] No.42183930{7}[source]
You are just arguing to argue, man. My words are NOT inconsistent, the inconsistency is you and you reading them in a combative fashion. You don't actually know what you are talking about and you project your own lack of knowledge onto me.

Yes, in Andalucía the vowel that precedes an aspirated /s/ changes to a different allophone of the vowel. It doesn't cease being an allophone of the vowel. If you ask a speaker who aspirates or omits their /s/ they'd say there's an /s/ there. That's why the /s/ can fully re-emerge if there's a vowel after it. It's not a phonemic difference, it's more like the /s/ is difficult for them to articulate in that position and that fact sometimes bleeds into the vowel, similar to /r/ for UK speakers of English. I think most dialects of Spanish do something like this with /s/ in that position, it's just a lot more frequent in Andalucía or the Caribbean and a few others.

I came close to mentioning this exact phenomenon but I didn't want to lengthen my comment on really "in the weeds" shit that isn't very relevant.

There's also the fact that in northern Mexico, I've heard the allophones they select for vowels are pretty different from most of the rest of the Spanish speaking world. I didn't mention it because I already said ... No phonemic difference.