←back to thread

469 points ghuntley | 1 comments | | HN request time: 0.228s | source
Show context
faangguyindia ◴[] No.45001426[source]
Anyone can build a coding agent which works on a) fresh code base b) when you've unlimited token budget

now build it for old codebase, let's see how precisely it edits or removes features without breaking the whole codebase

lets see how many tokens it consumes per bug fix or feature addition.

replies(4): >>45001529 #>>45001567 #>>45001784 #>>45001830 #
1. NitpickLawyer ◴[] No.45001830[source]
There's "swe re-bench", a benchmark that tracks model release dates, and you can see how the model did for "real-world" bugs that got submitted on github after the model was released. (obviously works best for open models).

There are a few models that solve 30-50% of (new) tasks pulled from real-wolrd repos. So ... yeah.