Most active commenters

    ←back to thread

    358 points tkgally | 12 comments | | HN request time: 0.001s | source | bottom

    The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.

    Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.

    [1] https://news.ycombinator.com/item?id=45053933

    Show context
    Symbiote ◴[] No.45072937[source]
    Using the HN public dataset in Google BigQuery [0], which I think fits easily in the amount of free queries allowed:

      SELECT 
        EXTRACT(YEAR FROM timestamp) AS year, 
        SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) AS withDash, 
        COUNT(*) AS total, 
        SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction
      FROM `bigquery-public-data.hacker_news.full` 
        WHERE type = 'comment' 
      GROUP BY year 
      ORDER BY year;
    
      year with—   total  frac
      2006     0      12 0.000
      2007    13   70858 0.000
      2008   461  247922 0.001
      2009  1497  491034 0.003
      2010  3835  842438 0.005
      2011  4719 1044913 0.005
      2012  5648 1246782 0.005
      2013  7881 1665185 0.005
      2014  8400 1510814 0.006
      2015  9967 1642912 0.006
      2016 12081 2093612 0.006
      2017 14530 2361709 0.006
      2018 19246 2384086 0.008
      2019 23662 2755063 0.009
      2020 27316 3243173 0.008
      2021 32863 3765921 0.009
      2022 34657 4062159 0.009
      2023 36611 4221940 0.009
      2024 32543 3339861 0.010
      2025 30608 2231919 0.014
    
    So there's definitely been an increase.

    Querying for the users who use "—" most as a proportion of all their comments:

      SELECT
        `by`,
        SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction,
        COUNT(*) AS total,
        MIN(timestamp) AS minTime,
        MAX(timestamp) AS maxTime
      FROM `bigquery-public-data.hacker_news.full` 
      WHERE 
        type = 'comment' AND 
        timestamp < '2022-11-30' 
      GROUP BY `by`
      HAVING COUNT(*) > 100
      ORDER BY fraction DESC
      LIMIT 250;
    
    zmgsabst uses them the most [1], westoncb [2] is an older account that uses them fourth-most.

    [0] https://console.cloud.google.com/marketplace/product/y-combi...

    [1] https://news.ycombinator.com/threads?id=zmgsabst

    [2] https://news.ycombinator.com/threads?id=westoncb

    replies(2): >>45072984 #>>45076079 #
    1. LeoPanthera ◴[] No.45072984[source]
    I took a peak at zmgsabst's comments, but they use them with spaces around the dash — like this.

    ChatGPT always uses them without spaces—like this.

    replies(4): >>45073038 #>>45073265 #>>45078138 #>>45078913 #
    2. Symbiote ◴[] No.45073038[source]
    Changing the filter to

      text LIKE '%—%' AND text NOT LIKE '% —%' AND text NOT LIKE '%— %'
    
    puts westoncb in the lead, followed by mucholove, trebbble, _zzaw and lexcorvus.
    replies(1): >>45075033 #
    3. indigodaddy ◴[] No.45073265[source]
    I always thought the proper usage was no space before but one space after-- like this.
    replies(1): >>45074803 #
    4. wizzwizz4 ◴[] No.45074803[source]
    There's no "proper usage" for any feature of English: it's all by consensus. However, I have seen that in published books from the 1900s.
    5. westoncb ◴[] No.45075033[source]
    I actually tweeted like a month ago that I was the reason LLMs use em dashes so much lol: https://x.com/Westoncb/status/1961802304698671407
    replies(1): >>45078912 #
    6. eMPee584 ◴[] No.45078138[source]
    & it looks awful without spaces — imho
    replies(2): >>45079372 #>>45080292 #
    7. JdeBP ◴[] No.45078912{3}[source]
    There are quite a few &mdash;es on my WWW site and on StackExchange thanks to me; and I vaguely recall that I might even have written one on Wikipedia once. But I am quite happy for you to take the blame for training the LLMs. (-:
    replies(1): >>45083218 #
    8. Rumudiez ◴[] No.45078913[source]
    The rule is spaces on both sides of an en dash – like so – or an em dash without any spaces—like this. Important to note the US keyboard layout does not have either of these or the minus glyph, just the hyphen, and it’s unadvisable to mix multiple styles
    9. JKCalhoun ◴[] No.45079372[source]
    Which is what I do (add a space before and after). I didn't know you weren't supposed to put the spaces until someone pointed it out to me — suggested I was not an LLM because I added the spaces.

    Makes me wonder if kerning is done correctly, if the em-dash would look like there were spaces before and after when there were not.

    replies(1): >>45080989 #
    10. colanderman ◴[] No.45080292[source]
    The common guidance I've seen is en dash with spaces, em dash without.
    11. card_zero ◴[] No.45080989{3}[source]
    Not at all, no. Here's a few historical examples:

    1903 edition of The Wizard of Oz — https://archive.org/details/newwizardofoz00baum/page/2/mode/...

    A page from Life magazine, 1894 — https://archive.org/details/sim_life_1894-08-23_24_608/page/...

    The Illustrated London News, 1843 — https://archive.org/details/illustrated-london-news-v002-184...

    The em dash pretty much just joins the two glyphs together. It's supposed to look that way.

    12. westoncb ◴[] No.45083218{4}[source]
    lol no problem. In reality though there's kind of a funny story behind it because I suspect the way I ended up using them so much is similar to how ChatGPT did. When I got into writing I studied grammar, then decided to read a bunch of classics and analyze their usage of punctuation in general until I had a good understanding of every bit of it. Then, in order to practice, I'd apply what I learned to anything I was writing at the time whether journal notes, conversations on AIM/IRC etc. That latter step meant I was translating a lot of casual/natural speech into a form that also had a high level of 'correctness'. And if you faithfully translate natural speech into 'correct'ly punctuated sentences, you end up using a lot of em dashes. Because ChatGPT/LLMs are tuned for natural/authentic style, as well as for a high degree of 'correctness,' you get today's state of affairs. Just a theory.