←back to thread

Phonetic Matching

(smoores.dev)
77 points raybb | 1 comments | | HN request time: 0.21s | source
Show context
asveikau ◴[] No.42172434[source]
The idea that "shore" and "sure" are pronounced "almost identically" would depend pretty heavily on your accent. The vowel is pretty different to me.

Also, the matches for "sorI" and "sorY" would seem to me to misinterpret the words as having a vowel at the end, rather than a silent vowel. If you're using data meant for foreign surnames, the rules of which may differ from English and which might have silent vowels be very rare depending on the original language, of course you may mispronounce English words like this, saying both shore and sure as "sore-ee".

I'm sure there are much better ways to transcribe orthography to phonetics, probably people have published libraries that do it. From some googling, it seems like some people call this type of library a phonemic transcriber or IPA transcriber.

replies(5): >>42172850 #>>42173496 #>>42177414 #>>42179389 #>>42180312 #
woodrowbarlow ◴[] No.42173496[source]
IPA is the most-used tool by linguistic researchers for encoding pronunciation in a standardized way. IPA is criticized for being a little bit anglo-centric and falls short for some languages and edge cases, but overall it performs pretty well. (learned from an ex who studies linguistics.)
replies(4): >>42173671 #>>42174382 #>>42174781 #>>42177483 #
bane ◴[] No.42177483[source]
This is sort of the inverse of the problem IPA is trying to solve. You're correct in that IPA is used to try to encode pronunciation. But phonetic matching is trying to encode those areas where different people, in different accents (maybe languages), say or write semantically the same thing, but differently -- but you need to find all the others using only one of the different versions without finding things that are not or irrelevant.

Basically it's trying to smush all the different versions together into a single sort of cluster, where the identity of the cluster is any of the versions.

I used to work in this field about 30 years ago, specifically how names can end up being latinized when coming from non-latin languages. We were very focused on trying to collapse variants into a complex ruleset that could be used both to recognize the cluster of names as being the same "thing", and then that ruleset could also produce all the valid variants. It was very much a kind of applied "expert systems" approach that predated ML.

The rulesets were more or less context free grammars and regular expressions that we could use to "decompile" a name token into a kind of limited regular expression (no infinite closures) and then recompile the expression back into a list of other variants. Each variant in turn was supposed to "decompile" back into the same expression so a name could be part of a kind of closed algebra of names all with the same semantic meaning.

For example:

A Korean name like "Park" might turn into a {rule} that would also generate "Pak", "Paek", "Baek", etc.

Any one of those would also generate the same {rule}.

In practice it worked surprisingly well, and the struggle was mostly in identifying the areas where the precision/recall in this scheme caused the names to not form a closed algebra.

Building the rules was an ungoldly amount of human labor though, with expert linguists involved at every step.

These days I'm sure the problem would be approached in an entirely different way.

replies(1): >>42178485 #
rhdunn ◴[] No.42178485[source]
In English there was analysis by John Wells defining several lexical sets for vowels [1]. Various other linguists have extended that to cover other accents such as Scottish, Irish, and Welsh.

Words in the same lexical sets are the result of splits and mergers. Usually through processes like you describe -- e.g. the Southern English BATH vowel resulted from a lengthening of the TRAP vowel and then changing quality to that of the PALM vowel.

Lexical sets don't cover consonant changes between accents such as rhotic r, the /sh/-/sj/ merger in shore and sure, tapped t, glottalized t, etc.

And on the thing you are talking about, Colin Gorrie has some YouTube videos on doing that comparative linguistics and rule construction. A lot of his videos are doing that for conlangs, but there are videos with real historical accents in several languages.

An example in English is the shift in pronunciation of 's' before 'u' from the /sj/ glide to the /sh/ sibilant so that in accents with that shift <sure> and <shore> are homophones (especially with the CURE-FORCE merger).

There are computer programs that you can use to express these rules and see how the pronunciations change over time. I think Colin uses one in some of his videos and Biblidarion uses one in some of his conlang videos to check phonetic evolution of his conlangs.

[1] https://en.wikipedia.org/wiki/Lexical_set

replies(1): >>42180183 #
1. bane ◴[] No.42180183[source]
Great response!

I forgot to mention that we struggled much more with cognate names that that were more linguistically distant than within ones that were closer. e.g. Matthew, Mattieu were likely to be within the same lexical set, but Matityahu may have been a bit to far for us.

It's interesting how some names tend to have more conserved features than others as they transit across larger distances in terms of language families. I worked later in genetics and was able to reapply many of the learning from names into gene sequences.

I used to test our own software with my given name. It's rather common, but our ruleset would produce some rather wild variations of it. We thought it was an error, but it turned out to be a completely valid name in Finnish!