←back to thread

358 points tkgally | 1 comments | | HN request time: 0.226s | source

The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.

Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.

[1] https://news.ycombinator.com/item?id=45053933

Show context
dns_snek ◴[] No.45073287[source]
Slightly tweaked, a leaderboard of em dash containing comments after ChatGPT release, limited to users who used them in fewer than 1% of comments before ChatGPT release, and who posted at least 200 comments before and after ChatGPT release. Data is recent (August 28th).

Of course this doesn't mean they're using ChatGPT either, they could've switched devices or started using them because they felt like it.

  #   user           before_chatgpt after_chatgpt  
  1   fao_           9/1777 (1 %)   36/225 (16 %)
  2   tlogan         1/962 (0 %)    59/399 (15 %)
  3   whynotminot    1/250 (0 %)    36/356 (10 %)
  4   unclebucknasty 13/2566 (1 %)  38/378 (10 %)
  5   iLemming       0/793 (0 %)    61/628 (10 %)
  6   nostrebored    10/1045 (1 %)  32/331 (10 %)
  7   freeone3000    0/2128 (0 %)   74/791 (9 %) 
  8   pdabbadabba    6/932 (1 %)    20/225 (9 %) 
  9   thebooktocome  4/632 (1 %)    18/208 (9 %) 
  10  tnecniv        0/671 (0 %)    34/446 (8 %) 
  11  dkersten       39/5092 (1 %)  24/318 (8 %) 
  12  stared         8/1565 (1 %)   29/392 (7 %) 
  13  ETH_start      3/385 (1 %)    75/1029 (7 %)
  14  tcbawo         2/792 (0 %)    15/218 (7 %) 
  15  jbm            2/406 (0 %)    22/350 (6 %) 
Query [2]:

  WITH by_user AS (
    SELECT
      `by` AS user,
      COUNTIF(text LIKE '%—%') AS match_count,
      COUNT(*) AS total_count,
      (timestamp >= '2022-11-30') AS after_chatgpt
    FROM `bigquery-public-data.hacker_news.full` 
    WHERE type = 'comment'
    GROUP BY user, after_chatgpt
  ),
  combined AS (
    SELECT
      user,
      MAX(IF(NOT after_chatgpt, match_count, 0)) AS match_before_chatgpt,
      MAX(IF(NOT after_chatgpt, total_count, 0)) AS total_before_chatgpt,
      MAX(IF(after_chatgpt, match_count, 0)) AS match_after_chatgpt,
      MAX(IF(after_chatgpt, total_count, 0)) AS total_after_chatgpt,
    FROM by_user
    GROUP BY user
    HAVING total_before_chatgpt >= 200 AND total_after_chatgpt >= 200
  ),
  with_fractions AS (
    SELECT
      *,
      SAFE_DIVIDE(match_before_chatgpt, total_before_chatgpt)  AS fraction_before_chatgpt,
      SAFE_DIVIDE(match_after_chatgpt, total_after_chatgpt) AS fraction_after_chatgpt
    FROM combined
  )
  SELECT
    user,
    FORMAT('%d/%d (%.0f %%)', match_before_chatgpt, total_before_chatgpt, ROUND(fraction_before_chatgpt*100)) AS before_chatgpt,
    FORMAT('%d/%d (%.0f %%)', match_after_chatgpt, total_after_chatgpt, ROUND(fraction_after_chatgpt*100)) AS after_chatgpt
  FROM with_fractions
  WHERE fraction_before_chatgpt < 0.01
  ORDER BY fraction_after_chatgpt DESC
  LIMIT 15
[1] https://news.ycombinator.com/item?id=45072937

[2] https://console.cloud.google.com/marketplace/product/y-combi...

replies(2): >>45074934 #>>45079271 #
1. stavros ◴[] No.45074934[source]
I think for this one you should do absolute, rather than relative, increase. The first place went from 9 to 36 whereas second went from 1 to 59, the number of comments they wrote without ChatGPT hitting an emdash shouldn't be relevant, I think.

It does need some normalization for people who post very few comments, but it feels more fair this way.