←back to thread

358 points tkgally | 2 comments | | HN request time: 0.519s | source

The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.

Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.

[1] https://news.ycombinator.com/item?id=45053933

Show context
Symbiote ◴[] No.45072937[source]
Using the HN public dataset in Google BigQuery [0], which I think fits easily in the amount of free queries allowed:

  SELECT 
    EXTRACT(YEAR FROM timestamp) AS year, 
    SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) AS withDash, 
    COUNT(*) AS total, 
    SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction
  FROM `bigquery-public-data.hacker_news.full` 
    WHERE type = 'comment' 
  GROUP BY year 
  ORDER BY year;

  year with—   total  frac
  2006     0      12 0.000
  2007    13   70858 0.000
  2008   461  247922 0.001
  2009  1497  491034 0.003
  2010  3835  842438 0.005
  2011  4719 1044913 0.005
  2012  5648 1246782 0.005
  2013  7881 1665185 0.005
  2014  8400 1510814 0.006
  2015  9967 1642912 0.006
  2016 12081 2093612 0.006
  2017 14530 2361709 0.006
  2018 19246 2384086 0.008
  2019 23662 2755063 0.009
  2020 27316 3243173 0.008
  2021 32863 3765921 0.009
  2022 34657 4062159 0.009
  2023 36611 4221940 0.009
  2024 32543 3339861 0.010
  2025 30608 2231919 0.014
So there's definitely been an increase.

Querying for the users who use "—" most as a proportion of all their comments:

  SELECT
    `by`,
    SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction,
    COUNT(*) AS total,
    MIN(timestamp) AS minTime,
    MAX(timestamp) AS maxTime
  FROM `bigquery-public-data.hacker_news.full` 
  WHERE 
    type = 'comment' AND 
    timestamp < '2022-11-30' 
  GROUP BY `by`
  HAVING COUNT(*) > 100
  ORDER BY fraction DESC
  LIMIT 250;
zmgsabst uses them the most [1], westoncb [2] is an older account that uses them fourth-most.

[0] https://console.cloud.google.com/marketplace/product/y-combi...

[1] https://news.ycombinator.com/threads?id=zmgsabst

[2] https://news.ycombinator.com/threads?id=westoncb

replies(2): >>45072984 #>>45076079 #
LeoPanthera ◴[] No.45072984[source]
I took a peak at zmgsabst's comments, but they use them with spaces around the dash — like this.

ChatGPT always uses them without spaces—like this.

replies(4): >>45073038 #>>45073265 #>>45078138 #>>45078913 #
eMPee584 ◴[] No.45078138[source]
& it looks awful without spaces — imho
replies(2): >>45079372 #>>45080292 #
1. JKCalhoun ◴[] No.45079372[source]
Which is what I do (add a space before and after). I didn't know you weren't supposed to put the spaces until someone pointed it out to me — suggested I was not an LLM because I added the spaces.

Makes me wonder if kerning is done correctly, if the em-dash would look like there were spaces before and after when there were not.

replies(1): >>45080989 #
2. card_zero ◴[] No.45080989[source]
Not at all, no. Here's a few historical examples:

1903 edition of The Wizard of Oz — https://archive.org/details/newwizardofoz00baum/page/2/mode/...

A page from Life magazine, 1894 — https://archive.org/details/sim_life_1894-08-23_24_608/page/...

The Illustrated London News, 1843 — https://archive.org/details/illustrated-london-news-v002-184...

The em dash pretty much just joins the two glyphs together. It's supposed to look that way.