←back to thread

223 points benkaiser | 1 comments | | HN request time: 0s | source
Show context
ks2048 ◴[] No.42538257[source]
This is interesting. I'm curious about how much (and what) these LLMs memorize verbatim.

Does anyone know any more thorough papers on this topic? For example, this could be tested on every verse in bible and lots of other text that is certainly in the training data: books in project gutenberg, wikipedia articles, etc.

Edit: this (and its references) looks like a good place to start: https://arxiv.org/abs/2407.17817v1

replies(2): >>42542876 #>>42543461 #
1. int_19h ◴[] No.42542876[source]
For one anecdotal data point, GPT-4 knows the "navy SEAL copypasta" verbatim. It can reproduce it complete with all the original typos and misspellings, and it can recognize it from the first sentence.