(rickiheicklen.com)

149 points ChadNauseam | 2 comments | 05 May 25 01:19 UTC | HN request time: 0s | source

Show context

gwern ◴[05 May 25 13:49 UTC] No.43895109[source]▶

An interesting ML exercise (possible class project!) would be to try to automate this. A bigram corpus combined with word embeddings and a NSFW text classifier, maybe? Your usual word embedding might not work because the point is the multiple meanings, so maybe the twist would be that you need a polysemous word embedding with multiple vectors or something like that, so it's not just an off-the-shelf word2vec...

replies(1): >>43895732 #

myflash13 ◴[05 May 25 14:50 UTC] No.43895732[source]▶

>>43895109 #

You don’t even need embeddings or Ml. A simple search across dictionaries, thesauruses and the Wikipedia entry list (with disambiguations) should be enough.

- Find all 2 word phrases and compound words - search across all pairwise combinations of mutual synonyms, and determine whether the compound synonym is itself a word or phrase

https://chatgpt.com/share/6818d11d-f444-800a-96b0-7a932e9213...

replies(1): >>43895765 #

1. gwern ◴[05 May 25 14:53 UTC] No.43895765[source]▶

>>43895732 #

That would be a good baseline. Maybe the ML part would be for ranking them? Because I expect you would drown in matches, and the hard part becomes finding the good ones.

replies(1): >>43895884 #

2. myflash13 ◴[05 May 25 15:04 UTC] No.43895884[source]▶

>>43895765 (TP) #

I expect the opposite. I would expect an ML/embedding approach to find lots of false positives, because lots of words have close embeddings but are not synonyms. A strict thesaurus lookup should produce fewer matches. As for ranking the good ones, an embedder might help with that, but then we need a definition of "good". I would argue the most conceptually "unrelated" matches are the "best" ones, so yes, an embedder could quickly determine the farthest vector distance.

↑

Unparalleled Misalignments