A Research Preview of Codex

Is anyone using any of these tools to write non boilerplate code?

I'm very interested.

In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.

These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.

And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.

That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.

What am I missing?

you might be missing small things to create more guardrails like effective prompting and maintaining what's been done using files, carefully controlling context, committing often in-between changes, but largely, you're not missing anything. i use AI constantly, but always for subtasks of a larger complicated thing that my brain has thought through. and often use higher cost models to help me abstractly think through complex things/point me in the right directions.

personally, i've always operated in a codebase in a way that i _need_ to understand how things work for me to be productive and make the right decisions. I operate the same way with AI. every change is carefully reviewed, if it's dumb, i make it redo it and explain why it's dumb. and if it gets caught in a loop, i reset the context and try to reframe the problem. overall, i'm definitely more productive, but if you truly want to be hands off--you're in for a very bad time. i've been there.

lastly, some codebases don't work well with AI. I was working on a problem that was a bit more novel/out there and no model could solve it. Just yapped endlessly about these complex, very potentially smart sounding solutions that did absolutely nothing. went all the way to o1-pro. the craziest part to me was the fact that across claude, deepseek and openai, they used the same specific vernacular for this particular problem which really highlights how a lot of these models are just a mish-mash of the same underlying architecture/internet data. some of these models use responses from other models for their training data, which to me is like incest. you won't get good genetical results