←back to thread

Unparalleled Misalignments

(rickiheicklen.com)
149 points ChadNauseam | 2 comments | | HN request time: 0s | source
Show context
gwern ◴[] No.43895109[source]
An interesting ML exercise (possible class project!) would be to try to automate this. A bigram corpus combined with word embeddings and a NSFW text classifier, maybe? Your usual word embedding might not work because the point is the multiple meanings, so maybe the twist would be that you need a polysemous word embedding with multiple vectors or something like that, so it's not just an off-the-shelf word2vec...
replies(1): >>43895732 #
myflash13 ◴[] No.43895732[source]
You don’t even need embeddings or Ml. A simple search across dictionaries, thesauruses and the Wikipedia entry list (with disambiguations) should be enough.

- Find all 2 word phrases and compound words - search across all pairwise combinations of mutual synonyms, and determine whether the compound synonym is itself a word or phrase

https://chatgpt.com/share/6818d11d-f444-800a-96b0-7a932e9213...

replies(1): >>43895765 #
1. gwern ◴[] No.43895765[source]
That would be a good baseline. Maybe the ML part would be for ranking them? Because I expect you would drown in matches, and the hard part becomes finding the good ones.
replies(1): >>43895884 #
2. myflash13 ◴[] No.43895884[source]
I expect the opposite. I would expect an ML/embedding approach to find lots of false positives, because lots of words have close embeddings but are not synonyms. A strict thesaurus lookup should produce fewer matches. As for ranking the good ones, an embedder might help with that, but then we need a definition of "good". I would argue the most conceptually "unrelated" matches are the "best" ones, so yes, an embedder could quickly determine the farthest vector distance.