Most active commenters

    ←back to thread

    358 points tkgally | 14 comments | | HN request time: 0.001s | source | bottom

    The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.

    Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.

    [1] https://news.ycombinator.com/item?id=45053933

    Show context
    Symbiote ◴[] No.45072937[source]
    Using the HN public dataset in Google BigQuery [0], which I think fits easily in the amount of free queries allowed:

      SELECT 
        EXTRACT(YEAR FROM timestamp) AS year, 
        SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) AS withDash, 
        COUNT(*) AS total, 
        SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction
      FROM `bigquery-public-data.hacker_news.full` 
        WHERE type = 'comment' 
      GROUP BY year 
      ORDER BY year;
    
      year with—   total  frac
      2006     0      12 0.000
      2007    13   70858 0.000
      2008   461  247922 0.001
      2009  1497  491034 0.003
      2010  3835  842438 0.005
      2011  4719 1044913 0.005
      2012  5648 1246782 0.005
      2013  7881 1665185 0.005
      2014  8400 1510814 0.006
      2015  9967 1642912 0.006
      2016 12081 2093612 0.006
      2017 14530 2361709 0.006
      2018 19246 2384086 0.008
      2019 23662 2755063 0.009
      2020 27316 3243173 0.008
      2021 32863 3765921 0.009
      2022 34657 4062159 0.009
      2023 36611 4221940 0.009
      2024 32543 3339861 0.010
      2025 30608 2231919 0.014
    
    So there's definitely been an increase.

    Querying for the users who use "—" most as a proportion of all their comments:

      SELECT
        `by`,
        SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction,
        COUNT(*) AS total,
        MIN(timestamp) AS minTime,
        MAX(timestamp) AS maxTime
      FROM `bigquery-public-data.hacker_news.full` 
      WHERE 
        type = 'comment' AND 
        timestamp < '2022-11-30' 
      GROUP BY `by`
      HAVING COUNT(*) > 100
      ORDER BY fraction DESC
      LIMIT 250;
    
    zmgsabst uses them the most [1], westoncb [2] is an older account that uses them fourth-most.

    [0] https://console.cloud.google.com/marketplace/product/y-combi...

    [1] https://news.ycombinator.com/threads?id=zmgsabst

    [2] https://news.ycombinator.com/threads?id=westoncb

    replies(2): >>45072984 #>>45076079 #
    1. hithereagain ◴[] No.45076079[source]
    Older people, say folks in their forties or older, grew up with the em dash.
    replies(2): >>45078876 #>>45079288 #
    2. JdeBP ◴[] No.45078876[source]
    That's backwards. People in that age bracket grew up with computers where the em dash was not in the character set at all, and typewriters and terminals only had a minus key.

    The people who grew up with the em dash are the younger HTML generation of 30 years ago where &mdash; was at least a reasonably convenient character entity even if they were using computers with the various 8-bit character sets that did not contain it.

    replies(3): >>45079195 #>>45079364 #>>45080348 #
    3. jml78 ◴[] No.45079195[source]
    Correct, I am 46, grew up with BBS. Early internet. I will be honest, never knew the name of em dash until it became a GPT thing.
    replies(2): >>45080609 #>>45080973 #
    4. jnwatson ◴[] No.45079288[source]
    Older people that grew up with "desktop publishing" and "The Mac is not a Typewriter" grew up with the em dash.
    replies(1): >>45079351 #
    5. JKCalhoun ◴[] No.45079351[source]
    Correct. And my typewriter dad will do two dashes --.
    replies(1): >>45079392 #
    6. JKCalhoun ◴[] No.45079364[source]
    True, but when desktop publishing arrived on the Mac, I embraced it.
    replies(1): >>45080949 #
    7. patrickmay ◴[] No.45079392{3}[source]
    Son?
    8. reaperducer ◴[] No.45080348[source]
    That's backwards. People in that age bracket grew up with computers where the em dash was not in the character set at all, and typewriters and terminals only had a minus key.

    I guess you weren't there. We did em-dashes on typewriters. We just turned the platen knob down one click, typed _, and turned it back.

    replies(2): >>45080533 #>>45080544 #
    9. ted_dunning ◴[] No.45080533{3}[source]
    None of us at our house did that.
    replies(1): >>45082398 #
    10. npsomaratna ◴[] No.45080544{3}[source]
    Anecdotally, what I've seen is that folks who learned typing in the 80s and earlier use two dashes '--' instead of the em-dash (although modern word processors seem to replace this combination with the em-dash). Something else I've noticed is their tendency to use two blank spaces between sentences.

    I'm a self-taught typist, with all the quirks that comes with (can type programming stuff very accurately at a 100+ WPM; can type normal stuff at a high WPM as well, but the error rate goes up).

    11. YVoyiatzis ◴[] No.45080609{3}[source]
    # Dash Usage Guide

    *Hyphen (-)* = word-joiner

    *En dash (–)* = “to/between”

    *Em dash (—)* = pause, punch, drama

    12. DonHopkins ◴[] No.45080949{3}[source]
    {—}
    13. JdeBP ◴[] No.45080973{3}[source]

        ... meaning that you have read some posts on this page a certain way.  (-:
        --- IM2000
         * Origin: Some WWW site named Hacker News (2:257/609.3)
    14. reaperducer ◴[] No.45082398{4}[source]
    That doesn't mean it didn't happen. Your house is not the only house.

    Moreover, your home is not representative of the millions of typewriters in businesses around the world.