←back to thread

Show HN: Hacker News em dash user leaderboard pre-ChatGPT

(www.gally.net)

358 points tkgally | 2 comments | 30 Aug 25 03:40 UTC | HN request time: 0.411s | source

The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.

Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.

[1] https://news.ycombinator.com/item?id=45053933

Show context

maaaaattttt ◴[30 Aug 25 10:17 UTC] No.45073453[source]▶

>>45071722 (OP) #

I think this whole em dash topic should lead to some deeper (though not very deep) conversations:

* If it was not widely used before where/how did (chat)GPT picked it up?

    * If it was widely used, then it shouldn't be a topic at all. But, there seems to be informal agreement that it wasn’t widely used.
    
    * Or, could GPT have inferred that even though it's not widely used, it's the better way to go (to use it). Which then makes one wonder about the whole probability of next token idea. Maybe this line of thinking falls too short of what might be really going on internally.

 * If it had picked up something that is widely used but in the wrong way, it should make us pause (again) about the future feedback loops these LLMs, which aren't going away, are already creating. Not just in terms of grammar and spelling but also in terms of way of thinking and seeing the world.

(edit: formatting)

replies(3): >>45073476 #>>45073485 #>>45073747 #

1. throwaway89201 ◴[30 Aug 25 10:24 UTC] No.45073485[source]▶

The training sets of most LLMs contain a copious amount of content from Libgen (or now: Anna's Archive), where em dashes are frequently used in literary writing.

replies(1): >>45078141 #

2. nullc ◴[30 Aug 25 21:24 UTC] No.45078141[source]▶

>>45073485 (TP) #

Who the hell knows how the initial biases of LLM's broke.

My IRC name (gmaxwell) is a token in the GPT3 tokenizer.