One way to start playing around with it is to put some stuff in a database: https://dev.mysql.com/doc/refman/8.4/en/string-functions.htm...
(or this module, https://www.postgresql.org/docs/current/fuzzystrmatch.html if you're stuck with PG)
One way to start playing around with it is to put some stuff in a database: https://dev.mysql.com/doc/refman/8.4/en/string-functions.htm...
(or this module, https://www.postgresql.org/docs/current/fuzzystrmatch.html if you're stuck with PG)
Also, the matches for "sorI" and "sorY" would seem to me to misinterpret the words as having a vowel at the end, rather than a silent vowel. If you're using data meant for foreign surnames, the rules of which may differ from English and which might have silent vowels be very rare depending on the original language, of course you may mispronounce English words like this, saying both shore and sure as "sore-ee".
I'm sure there are much better ways to transcribe orthography to phonetics, probably people have published libraries that do it. From some googling, it seems like some people call this type of library a phonemic transcriber or IPA transcriber.
Another example is poetry. A regex can find rhymes in Polish. Same postfix == it rhymes.
In English it's a feat of engineering.
This past Yom Kippur, my wife and I drove two hours to spend the afternoon at my aunt’s house, with my cousins. As the night drew on, conversation roamed from television shows and books to politics and philosophy. The circle grew as we touched on increasingly sensitive and challenging topics, drawing us in.
We didn’t agree, per se. We were engaging in debate as often as we were engaging in conversation. But we all love each other deeply, and the amount of care and restraint that went into how each person expressed their disagreement was palpable.
For an idea of how bad various accents can complicate recognition see how Baltimoreans pronounce "Aaron Earned An Iron Urn": https://www.youtube.com/watch?v=Oj7a-p4psRA
(Yeah it's labor-intensive, but probably not moreso than, say, writing a dictionary.)
from phonemizer.phonemize import phonemize
text = "hello world"
variations = [
phonemize(text, backend="espeak", language="en-us", strip=True),
phonemize(text, backend="espeak", language="en-gb", strip=True),
phonemize(text, backend="espeak", language="en-au", strip=True),
]
I mean, espeak isnt the best but a lot of folks in the ASR/Speech world still are using this right?(NB: If you are on iOS check out the inbuilt one - Settings -> Accessibility -> Spoken Content -> Pronounciations. Adding one it has the ability to phonemize to IPA your spoken message. If someone can tell me where that SDK/API is they use in that I'd love to know) for i, variation in enumerate(variations, 1): print(f"Variation {i}: {variation}")
I suppose, if we ignore accents and heteronyms... both of which English is famous for, unfortunately!
For other languages I have exposure to, IPA seems to make more sense. Possibly I have a bias in that they're not my native language, so I can analyze them instead of internalizing them. But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.
Not sure how it will perform on something long like an audiobook.
My highschool English teacher was horrified when she figured out why us boys were laughing when reading her copy of the first folio, our hick accent ment we were getting some of the jokes she didn't even notice.
Theme rhyming with sixteen in the Cranberry's song Zombie is another.
Other languages have this even worse. Try comparing Egyptian Colloquial Arabic vs literary Arabic. I mean... these are different languages. Or, for instance, American Sign Language (ASL) vs. written English: the former is more like Chinese than English.
(we worked on all this in Motorola in 2001 extensively....then they dropped it)
That's not the similarity the author is trying to point out. The idea is that the spelling is a lot more different than the pronunciation is, and that's true. The pronunciations are as similar as it's possible to be, measured by substitution count, without actually being identical. (You could use a measure of phonetic similarity, in which case e.g. fought and thought would be much more similar than fought and caught, but he's not doing that either.)
The pronunciation of sure comes from (1) the old, dead idea that the letter u should be pronounced /ju/ rather than /u/ (compare cure); and (2) the still vital English reduction of /sj/ to /ʃ/. Shore has to indicate the same sound in a radically different way, since it doesn't have and never had a medial /j/ to transform a bare s.
Basically it's trying to smush all the different versions together into a single sort of cluster, where the identity of the cluster is any of the versions.
I used to work in this field about 30 years ago, specifically how names can end up being latinized when coming from non-latin languages. We were very focused on trying to collapse variants into a complex ruleset that could be used both to recognize the cluster of names as being the same "thing", and then that ruleset could also produce all the valid variants. It was very much a kind of applied "expert systems" approach that predated ML.
The rulesets were more or less context free grammars and regular expressions that we could use to "decompile" a name token into a kind of limited regular expression (no infinite closures) and then recompile the expression back into a list of other variants. Each variant in turn was supposed to "decompile" back into the same expression so a name could be part of a kind of closed algebra of names all with the same semantic meaning.
For example:
A Korean name like "Park" might turn into a {rule} that would also generate "Pak", "Paek", "Baek", etc.
Any one of those would also generate the same {rule}.
In practice it worked surprisingly well, and the struggle was mostly in identifying the areas where the precision/recall in this scheme caused the names to not form a closed algebra.
Building the rules was an ungoldly amount of human labor though, with expert linguists involved at every step.
These days I'm sure the problem would be approached in an entirely different way.
> But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.
The first and last of those are essentially guaranteed to be false.
> Possibly I have a bias in that they're not my native language
The more likely bias is that you just don't know very much about those other languages.
You'd have to be insane to think that, for example, IPA for Spanish isn't easier than IPA for English vowels. In contrast to English, most Spanish regional pronunciations are about consonants. And the orthography is very regular. If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.
No. IPA encodes sounds based on various aspects of articulation. A word has unique phonemes (enclosed in forward slashes, //), but not necessarily unique sounds (allophones, enclosed in brackets, []).
Words in the same lexical sets are the result of splits and mergers. Usually through processes like you describe -- e.g. the Southern English BATH vowel resulted from a lengthening of the TRAP vowel and then changing quality to that of the PALM vowel.
Lexical sets don't cover consonant changes between accents such as rhotic r, the /sh/-/sj/ merger in shore and sure, tapped t, glottalized t, etc.
And on the thing you are talking about, Colin Gorrie has some YouTube videos on doing that comparative linguistics and rule construction. A lot of his videos are doing that for conlangs, but there are videos with real historical accents in several languages.
An example in English is the shift in pronunciation of 's' before 'u' from the /sj/ glide to the /sh/ sibilant so that in accents with that shift <sure> and <shore> are homophones (especially with the CURE-FORCE merger).
There are computer programs that you can use to express these rules and see how the pronunciations change over time. I think Colin uses one in some of his videos and Biblidarion uses one in some of his conlang videos to check phonetic evolution of his conlangs.
Perhaps "sure" and "shore" was a bad example; it's tricky to come up with these! And you're right that the encodings that happen to overlap for those words are technically "incorrect" pronunciations; again, these Soundex-style encoders are designed for surnames, not general English words. Some Storyteller users are testing out a version of Storyteller using this encoder to see if it makes any improvements (so far it seems like it's not worse, but not necessarily better!), but I won't be surprised if it doesn't end up making it into Storyteller long term.
Mostly I wrote this piece not to advocate for using BMPM to support forced alignment, but as a way to express the emotional journey that I found myself on as I learned more about these systems and where they came from.
None of that takes anything away from what I said. The more likely bias is that you don't know much about the languages you're referring to. Whether you can read or write them doesn't even speak to these questions.
"Cleaner phonetics" doesn't have a meaning. And the idea that there's less regional variation in Spanish than English is not plausible.
> If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.
Do you usually notate that [d] and [t] in standard Spanish are generally dental rather than alveolar?
IPA is almost never used with a concern for the phonetic accuracy of the symbols. It's almost always used to indicate phonemes. You can even read John Wells arguing vehemently that correct IPA for English should use the symbols "e" and "r" because those are familiar to people who use the English alphabet.
I forgot to mention that we struggled much more with cognate names that that were more linguistically distant than within ones that were closer. e.g. Matthew, Mattieu were likely to be within the same lexical set, but Matityahu may have been a bit to far for us.
It's interesting how some names tend to have more conserved features than others as they transit across larger distances in terms of language families. I worked later in genetics and was able to reapply many of the learning from names into gene sequences.
I used to test our own software with my given name. It's rather common, but our ruleset would produce some rather wild variations of it. We thought it was an error, but it turned out to be a completely valid name in Finnish!
With the history of reading and speaking (Indian) phonetic languages, I think, English would've been much nicer and uniform if the vowels sounded right, esp the long forms.
Extending the long forms using orthogonal vowels probably made it complex, especially with the lack of ii and uu.
Say for instance, to extend the long form of "o", "a" was used. Eg: boat, goat. The correct spelling could've been boot, with the original boot spelled as buut.
With that notion, door is probably the only word that's written and pronounced phonetically correct, with two oo.
Curious to know how would such correct phonetic translation aid in the encoding, matching and compression.
[0] https://docs.google.com/spreadsheets/d/15hdVh-oBUngTyigqDdjg...
I didn't say there was not variation. I said there was less of a specific type. Specifically there are not phonemic vowel variations as in English. In terms of consonants, some regions have extra consonant phonemes such as /θ/ or /λ/. Some have a tendency to omit some consonants in some positions. Every other difference is around which allophones get expressed preferentially.
> Do you usually notate that [d] and [t] in standard Spanish are generally dental rather than alveolar?
Typically I've seen it notated that way between [] but not between //. It's an articulation detail rather than phonemic.
This is honestly a not very intelligent point, you're just doing trivia on Spanish phonetics now. I know these things too. They're not at all relevant to my point.
Here are your exact words:
>> But also, they have cleaner phonetics, cleaner orthography, and less regional variation of phonemes.
>> You'd have to be insane to think that, for example, IPA for Spanish isn't easier than IPA for English vowels.
If you don't want to defend what you've said... don't just pretend that you said something completely different.
I might also ask whether you're sure that the different varieties of Spanish actually exhibit less variation in their vowels than the varieties of English do, as opposed to the impact of this variation being muted by the much smaller count of vowel phonemes.
As for this:
> I've always found IPA to be deeply confusing for English, because different accents have different historical vowel mergers, so I am never sure about vowels.
> If you give me the correct spelling of any word and a short description of where the speaker is from I can give you the IPA with remarkable accuracy, a task that would be very difficult for English.
Those two claims directly conflict with each other. The second one is more correct, in its first half. If you know your target dialect, you can produce conventional IPA for any given word. The orthography of English usually makes this easier by preserving information about the historical pronunciation of the word. If what you want is to produce IPA for an English word without knowing the dialect it's going to be pronounced in, that's no more possible in Spanish than it is in English, and you've already noted this fact.
> This is honestly a not very intelligent point, you're just doing trivia on Spanish phonetics now. I know these things too. They're not at all relevant to my point.
What point? What do you think you're complaining about, if not trivia on English phonetics?
Try articulating an actual problem with the use of IPA for English that, in your opinion, doesn't occur in every other language.
> Specifically there are not phonemic vowel variations [in Spanish] as in English. In terms of consonants, some regions have extra consonant phonemes such as /θ/ or /λ/. Some have a tendency to omit some consonants in some positions. Every other difference is around which allophones get expressed preferentially.
Going purely from Wikipedia...
> For those areas of southeastern Spain where the deletion of final /s/ is complete, and where the distinction between singular and plural of nouns depends entirely on vowel quality, it has been argued that a set of phonemic splits has occurred, resulting in a system with eight vowel phonemes in place of the standard five.
Yes, in Andalucía the vowel that precedes an aspirated /s/ changes to a different allophone of the vowel. It doesn't cease being an allophone of the vowel. If you ask a speaker who aspirates or omits their /s/ they'd say there's an /s/ there. That's why the /s/ can fully re-emerge if there's a vowel after it. It's not a phonemic difference, it's more like the /s/ is difficult for them to articulate in that position and that fact sometimes bleeds into the vowel, similar to /r/ for UK speakers of English. I think most dialects of Spanish do something like this with /s/ in that position, it's just a lot more frequent in Andalucía or the Caribbean and a few others.
I came close to mentioning this exact phenomenon but I didn't want to lengthen my comment on really "in the weeds" shit that isn't very relevant.
There's also the fact that in northern Mexico, I've heard the allophones they select for vowels are pretty different from most of the rest of the Spanish speaking world. I didn't mention it because I already said ... No phonemic difference.