Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.
You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.
We really have no idea how much AI is ruining education right now.
Any of the following could work, though the specific tradeoffs & implementation details do vary:
- have <n> teachers walking around the room to watch for cheaters
- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles
- record from above and use AI to flag potential cheaters for manual review
- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)
- build dedicated examination rooms lined with metal mesh to disrupt cell reception
So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.