←back to thread

Using LLMs at Oxide

(rfd.shared.oxide.computer)
702 points steveklabnik | 1 comments | | HN request time: 0.212s | source
Show context
cobertos ◴[] No.46179465[source]
> LLMs are especially good at evaluating documents to assess the degree that an LLM assisted their creation!)

That's a bold claim. Do they have data to back this up? I'd only have confidence to say this after testing this against multiple LLM outputs, but does this really work for, e.g. the em dash leaderboard of HN or people who tell an LLM to not do these 10 LLM-y writing cliches? I would need to see their reasoning on why they think this to believe.

replies(3): >>46180279 #>>46180384 #>>46182998 #
bcantrill ◴[] No.46182998[source]
I am really surprised that people are surprised by this, and honestly the reference was so casual in the RFD because it's probably the way that I use LLMs the most (so very much coming from my own personal experience). I will add a footnote to the RFD to explain this, but just for everyone's benefit here: at Oxide, we have a very writing-intensive hiring process.[0] Unsurprisingly, over the last six months, we have seen an explosion of LLM-authored materials (especially for our technical positions). We have told applicants to be careful about doing this[1], but they do it anyway. We have also seen this coupled with outright fraud (though less frequently). Speaking personally, I spend a lot of time reviewing candidate materials, and my ear has become very sensitive to LLM-generated materials. So while I generally only engage an LLM to aid in detection when I already have a suspicion, they have proven adept. (I also elaborated on this a little in our podcast episode with Ben Shindel on using LLMs to explore the fraud of Aidan Toner-Rodgers.[2])

I wasn't trying to assert that LLMs can find all LLM-generated content (which feels tautologically impossible?), just that they are useful for the kind of LLM-generated content that we seek to detect.

[0] https://rfd.shared.oxide.computer/rfd/0003

[1] https://oxide.computer/careers

[2] https://oxide-and-friends.transistor.fm/episodes/ai-material...

replies(2): >>46184724 #>>46187430 #
1. cobertos ◴[] No.46187430[source]
I still don't quite get this reasoning. A statistical model for detecting a category (like is this written hiring material LLM generated or not, is this email spam or not, etc) is most metricized by its false positive and false negative rate. But it doesn't sound like anyone measures this, it just gets applied after a couple times of "huh, that worked" and we move on. There's a big difference between a model that performs successfully 70% of the time vs one that performs 99% but I'm not sure we can say which this is?

Maybe if LLMs were aligned for this specific task it'd make more sense? But they're not. Their alignment tunes them to provide statistically helpful responses for a wide variety of things. They prefer positive responses to negative ones and are not tuned directly as a detection tool for arbitrary categorization. And maybe they do work well, but maybe it's only a specific version of a specific model against other specific models hiring material outputs? There's too many confounding things here to not have to study this in a rigorous way to come to the conclusion that felt... not carefully considered.

Maybe you have considered this more than I know. It sounds like you work a lot with this data. But the off-handedness set off my skepticism.