Show HN: Hacker News em dash user leaderboard pre-ChatGPT

1. maaaaattttt ◴[30 Aug 25 10:17 UTC] No.45073453[source]▶

I think this whole em dash topic should lead to some deeper (though not very deep) conversations:

* If it was not widely used before where/how did (chat)GPT picked it up?

    * If it was widely used, then it shouldn't be a topic at all. But, there seems to be informal agreement that it wasn’t widely used.
    
    * Or, could GPT have inferred that even though it's not widely used, it's the better way to go (to use it). Which then makes one wonder about the whole probability of next token idea. Maybe this line of thinking falls too short of what might be really going on internally.

 * If it had picked up something that is widely used but in the wrong way, it should make us pause (again) about the future feedback loops these LLMs, which aren't going away, are already creating. Not just in terms of grammar and spelling but also in terms of way of thinking and seeing the world.

(edit: formatting)

replies(3): >>45073476 #>>45073485 #>>45073747 #

2. msgodel ◴[30 Aug 25 10:22 UTC] No.45073476[source]▶

>>45073453 (TP) #

It's used a lot in formal writing (academic papers, books etc) which are probably a large portion of chatGPTs training. If the HRL was done by professional writers then it was probably additionally biased toward using them.

People are more casual on the web. It's sort of like how people can often tell when it's me in IM without my name because I properly use periods while that's unusual in that medium. ChatGPT is so correct it feels robotic.

replies(1): >>45073736 #

3. throwaway89201 ◴[30 Aug 25 10:24 UTC] No.45073485[source]▶

>>45073453 (TP) #

The training sets of most LLMs contain a copious amount of content from Libgen (or now: Anna's Archive), where em dashes are frequently used in literary writing.

replies(1): >>45078141 #

4. maaaaattttt ◴[30 Aug 25 11:23 UTC] No.45073736[source]▶

>>45073476 #

It’s the most likely explanation I believe. I have no idea about the content distribution of the training data but I would have assumed twitter and Reddit content would completely dwarf the literary content. Somewhat good that if it’s indeed not the case!

5. Hilift ◴[30 Aug 25 11:26 UTC] No.45073747[source]▶

>>45073453 (TP) #

It isn't about wide use. It is about a character that almost no-one enters explicitly. Nearly all usages are copy paste, or inadvertent/unintended conversion by an application such as Microsoft Word that converts regular quotes to smart quotes, etc. In that respect, we see that an AI is performing identically to a real human. An AI does not and most likely would not add see a purpose an em or en dash to any text, unless it was an article about em or en dashes, or they knew the person they were speaking with uses en or em dashes.

6. nullc ◴[30 Aug 25 21:24 UTC] No.45078141[source]▶

>>45073485 #

Who the hell knows how the initial biases of LLM's broke.

My IRC name (gmaxwell) is a token in the GPT3 tokenizer.