←back to thread

358 points tkgally | 2 comments | | HN request time: 0.611s | source

The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.

Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.

[1] https://news.ycombinator.com/item?id=45053933

1. latexr ◴[] No.45071981[source]
I’d be interested in seeing how the data changes if instead of the total raw number of posts with em-dashes you instead check for their percentage considering the total number of posts. I guess the folks who registered later would be bumped up the list?
replies(1): >>45075006 #
2. svat ◴[] No.45075006[source]
Try it here (you may have to create a Google Cloud project, but you don't have to enable billing or start the free trial):

https://console.cloud.google.com/bigquery?p=bigquery-public-...

Click on the `+` (white over blue background) in the tab bar at the top that says "SQL query" on popup, and type the following (I use the GoogleSQL pipe syntax (https://cloud.google.com/bigquery/docs/reference/standard-sq... / https://news.ycombinator.com/item?id=41347188) below, but you can also use standard SQL if you prefer):

    FROM `bigquery-public-data.hacker_news.full` 
    |> WHERE type = 'comment' AND timestamp < '2022-11-30'
    |> AGGREGATE COUNT(*) AS total, COUNTIF(text LIKE '%—%') AS with_em GROUP BY `by`
    |> EXTEND with_em / total AS fraction_with_em
    |> ORDER BY fraction_with_em DESC
    |> WHERE total > 100 AND fraction_with_em > 0.1
(I'm in place 47 of the 516 results, with 0.29 of my comments (258 of 875) having an em dash in them.)

Edit: As you also asked about timestamps:

    FROM `bigquery-public-data.hacker_news.full`
    |> WHERE type = 'comment' AND timestamp < '2022-11-30'
    |> EXTEND text LIKE '%—%' AS has_em
    |> AGGREGATE
        COUNT(*) AS total,
        COUNTIF(has_em) AS with_em,
        MIN(timestamp) AS first_comment_timestamp,
        MIN(IF(has_em, timestamp, NULL)) AS first_em_timestamp,
        TIMESTAMP_SECONDS(CAST(AVG(time) AS INT64)) AS avg_comment_timestamp,
        TIMESTAMP_SECONDS(CAST(AVG(IF(has_em, time, NULL)) AS INT64)) AS avg_em_timestamp,
      GROUP BY `by`
    |> EXTEND with_em / total AS fraction_with_em
    |> ORDER BY fraction_with_em DESC
    |> WHERE total > 100 AND fraction_with_em > 0.1
for most people the average timestamp is just the midpoint of when they started posting (with em dashes) and the cutoff date of 2022-11-30, and the top-place user zmgsabst stands out for having started only in late January 2022.