At this level, it is very contextual - depending on your tools, prompts, language, libraries, and the whole code base. For example, for one project, I am generating ggplot2 code in R; Claude 3.5 gives way better results than the newer Claude 3.7.
Compare and contrast https://aider.chat/docs/leaderboards/, https://web.lmarena.ai/leaderboard, https://livebench.ai/#/.