Most active commenters

←back to thread

A Research Preview of Codex

(openai.com)

511 points meetpateltech | 24 comments | 16 May 25 15:02 UTC | HN request time: 1.063s | source | bottom

1. prhn ◴[16 May 25 15:32 UTC] No.44006680[source]▶

>>44006345 (OP) #

Is anyone using any of these tools to write non boilerplate code?

I'm very interested.

In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.

These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.

And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.

That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.

What am I missing?

replies(16): >>44006706 #>>44006751 #>>44006766 #>>44006808 #>>44006858 #>>44006868 #>>44006872 #>>44007014 #>>44007038 #>>44007115 #>>44007288 #>>44007383 #>>44007699 #>>44009108 #>>44012169 #>>44014213 #

2. icapybara ◴[16 May 25 15:34 UTC] No.44006706[source]▶

>>44006680 (TP) #

It’s probably what you’re asking. You can’t just say “write me an app”, you have to break a big problem into small problems for it.

3. spariev ◴[16 May 25 15:38 UTC] No.44006751[source]▶

>>44006680 (TP) #

I think it all depends on your platform and use cases. In my experience AI tools work best with Python and JS/Typescript and some simple use cases (web apps, basic data science etc). Also, I've found they can be of great help with refactorings and cases when you need to do something similar to already existing code, but with a twist or change.

4. volkk ◴[16 May 25 15:40 UTC] No.44006766[source]▶

>>44006680 (TP) #

you might be missing small things to create more guardrails like effective prompting and maintaining what's been done using files, carefully controlling context, committing often in-between changes, but largely, you're not missing anything. i use AI constantly, but always for subtasks of a larger complicated thing that my brain has thought through. and often use higher cost models to help me abstractly think through complex things/point me in the right directions.

personally, i've always operated in a codebase in a way that i _need_ to understand how things work for me to be productive and make the right decisions. I operate the same way with AI. every change is carefully reviewed, if it's dumb, i make it redo it and explain why it's dumb. and if it gets caught in a loop, i reset the context and try to reframe the problem. overall, i'm definitely more productive, but if you truly want to be hands off--you're in for a very bad time. i've been there.

lastly, some codebases don't work well with AI. I was working on a problem that was a bit more novel/out there and no model could solve it. Just yapped endlessly about these complex, very potentially smart sounding solutions that did absolutely nothing. went all the way to o1-pro. the craziest part to me was the fact that across claude, deepseek and openai, they used the same specific vernacular for this particular problem which really highlights how a lot of these models are just a mish-mash of the same underlying architecture/internet data. some of these models use responses from other models for their training data, which to me is like incest. you won't get good genetical results

5. ◴[16 May 25 15:44 UTC] No.44006808[source]▶

>>44006680 (TP) #

6. Workaccount2 ◴[16 May 25 15:48 UTC] No.44006858[source]▶

>>44006680 (TP) #

>What am I missing?

That you are trying to use LLMs to create giant sprawling codebase feature packed software packages that define the modern software landscape. What's being missed is that any one user might only utilize 5% of the code base on any given day. Software is written to accommodate every need every user could have in one package. Then the users just use the small slice that accommodates their specific needs.

I have now created 5 hyper narrow programs that are used daily by my company to do work. I am not a programmer and my company is not a tech company located in a tech bubble. We are a tiny company that does old school manufacturing.

To give a quick general example, Betty uses Excel to manage payroll. A list of employees, a list of wages, a list of hours worked (which she copys from the time clock software .csv that she imports to excel).

Excel is a few million LOC program and costs ~$10/mo. Betty needs maybe 2k LOC to do what she uses excel for. Something an LLM can do easily, a python GUI wrapper on an SQLite DB. And she would be blown away at how fast it is, and how it is written for her use specifically.

How software is written and how it is used will change to accommodate LLMs. We didn't design cars to drive on horse paths, we put down pavement.

replies(2): >>44007080 #>>44009987 #

7. Cu3PO42 ◴[16 May 25 15:49 UTC] No.44006868[source]▶

>>44006680 (TP) #

Occasionally. I find that there is a certain category of task that I can hand over to an LLM and get a result that takes me significantly less time to clean up than it would have taken me to write from scratch.

A recent example from a C# project I was working in. The project used builder classes that were constructed according to specified rules, but all of these builders were written by hand. I wanted to automatically generate these builders, and not using AI, just good old meta-programming.

Now I knew enough to know that I needed a C# source generator, but I had absolutely no experience with writing them. Could I have figured this out in an hour or two? Probably. Did I write a prompt in less than five minutes and get a source generator that worked correctly in the first shot? Also yes. I then spent some time cleaning up that code and understanding the API it uses to hook into everything and was done in half an hour and still learnt something from it.

You can make the argument that this source generator is in itself "boilerplate", because it doesn't contain any special sauce, but I still saved significant time in this instance.

8. uludag ◴[16 May 25 15:49 UTC] No.44006872[source]▶

>>44006680 (TP) #

I feel things get even worse when you use a more niche language. I get extremely disappointed any time I try to get it do anything useful in Clojure. Even as a search engine, especially when asking it about libraries, these tools completely fail expectation.

I can't even fathom how frustrating such tools would be with poorly written confusing Clojure code using some niche dependency.

That being said, I can imagine a whole class of problems where this could succeed very well at and provide value. Then again, the type of problems that I feel these systems could get right 99% of the time are problems that a skilled developer could fix in minutes.

9. sottol ◴[16 May 25 16:01 UTC] No.44007014[source]▶

>>44006680 (TP) #

I tried using Gemini 2.5 Pro for a side-side-project, seemed like a good project to explore LLMs and how they'd fit into my workflow. 2-3 weeks later it's around 7k loc of Python auto-gerating about 35k loc of C from JSON spec.

This project is not your typical Webdev project, so maybe that's an interesting case-study. It takes a C-API spec in JSON, loads and processes it in Python and generates a C-library that turns a UI marked up YAML/JSON into C-Api calls to render that UI. [1]

The result is pretty hacky code (by my design, can't/won't use FFI) that's 90% written by Gemini 2.5 Pro Pre/Exp but it mostly worked. It's around 7k lines of Python that generate a 30-40k loc C-library from a JSON LVGL-API-spec to render an LVGL UI from YAML/JSON markup.

I probably spent 2-3 weeks on this, I might have been able to do something similar in maybe 2x the time but this is about 20% of the mental overhead/exhaustion it would have taken me otherwise. Otoh, I would have had a much better understanding of the tradeoffs and maybe a slightly cleaner architecture if I would have to write it. But there's also a chance I would have gotten lost in some of the complexity and never finished (esp since it's a side-project that probably no-one else will ever see).

What worked well:

* It mostly works(!). Unlike previous attempts with Gemini 1.5 where I had to spend about as much or more time fixing than it'd have taken me to write the code. Even adding complicated features after the fact usually works pretty well with minor fixing on my end.

* Lowers mental "load" - you don't have to think so much about how to tackle features, refactors, ...

Other stuff:

* I really did not like Cursor or Windsurf - I half-use VSCode for embedded hobby projects but I don't want to then have another "thing" on top of that. Aider works, but it would probably require some more work to get used to the automatic features. I really need to get used to the tooling, not an insignificant time investment. It doesn't vibe with how I work, yet.

* You can generate a *significant* amount of code in a short time. It doesn't feel like it's "your" code though, it's like joining a startup - a mountain of code, someone else's architecture, their coding style, comment style, ... and,

* there's this "fog of code", where you can sorta bumble around the codebase but don't really 100% understand it. I still have mid/low confidence in the changes I make by hand, even 1 week after the codebase has largely stabilized. Again, it's like getting familiar with someone else's code.

* Code quality is ok but not great (and partially my fault). Probably depends on how you got to the current code - ie how clean was your "path". But since it is easier to "evolve" the whole project (I changed directions once or twice when I sort of hit a wall) it's also easier to end up with a messy-ish codebase. Maybe the way to go is to first explore, then codify all the requirements and start afresh from a clean slate instead of trying to evolve the code-base. But that's also not an insignificant amount of work and also mental load (because now you really need to understand the whole codebase or trust that an LLM can sufficiently distill it).

* I got much better results with very precise prompts. Maybe I'm using it wrong, ie I usually (think I) know what I want and just instruct the LLM instead of having an exploratory chat but the more explicit I am, the more closely the output is to what I'd like to see. I've tried to discuss proposed changes a few times to generate a spec to implement in another session but it takes time and was not super successful. Another thing to practice.

* A bit of a later realization, but modular code and short, self-contained modules are really important though this might depend on your workflow.

To summarize:

* It works.

* It lowers initial mental burden.

* But to get really good results, you still have to put a lot of effort into it.

* At least right now, it seems you will still eventually have to put in the mental effort at some point, normally it's "front-loaded" where you have to do the design and think about it hard, whereas the AI does all the initial work but it becomes harder to cope with the codebase once you reach a certain complexity. Eventually you will have to understand it though even if just to instruct the LLM to make the exact changes you want.

[1] https://github.com/thingsapart/lvgl_ui_preview

10. asadm ◴[16 May 25 16:03 UTC] No.44007038[source]▶

>>44006680 (TP) #

yes, think of it as search engine that auto-applies that stackoverflow fix to your code.

But I have done larger tasks (write device drivers) using gemini.

11. kridsdale3 ◴[16 May 25 16:06 UTC] No.44007080[source]▶

The Romans put down paved roads to make their horse paths more reliable.

But yes, I hope we get away from the giant conglomeration of everything, ESPECIALLY the reality of people doing 90% of their business inside a Google Chrome widow. Move towards the UNIX philosophy of tiny single-purpose programs.

12. browningstreet ◴[16 May 25 16:09 UTC] No.44007115[source]▶

>>44006680 (TP) #

I've built a number of personal data-oriented and single purpose tools in Replit. I've constrained my ambitions to what I think it can do but I've added use cases beyond my initial concept.

In short, the tools work. I've built things 10x faster than doing it from scratch. I also have a sense of what else I'll be able to build in a year. I also enjoy not having to add cycles to communicate with external contributors -- I think, then I do, even if there's a bit of wrestling. Wrangling with a coding agent feels a bit like "compile, test, fix, re-compile". Re-compiling generally got faster in subsequent generations of compiler releases.

My company is building internal business functions using AI right now. It works too. We're not putting that stuff in front of our customers yet, but I can see that it'll come. We may put agents into the product that let them build things for themselves.

I get the grumpiness & resistance, but I don't see how it's buying you anything. The puck isn't underfoot.

13. IXCoach ◴[16 May 25 16:26 UTC] No.44007288[source]▶

>>44006680 (TP) #

Hey there!

Lots missing here, but I had the same issues, it takes iteration and practice. I use claude code in terminal windows, and text expander to save explicit reminders that I have to inject super regularly because anthropic obscures access to system prompts.

For example, I have 3 to 8 paragraph long instructions I will place regularly about not assuming, checking deterministically etc. and for most things I have the agents write a report with a specific instruction set.

I pop the instructions into text expander so I just type - docs when saying go figure this out, and give me the path to the report when done.

They come back with a path, and I copy it and search vscode

It opens as an md and i use preview mode, its similar to a google doc.

And ill review it. always, things will be wrong, tons of assumptions, failures to check determistically, etc... but I see that in the doc and have it fix it. correct misunderstandings, update the doc until its perfect.

From there ill say add a plan in a table with status for each task based on this ( another text expander snippet with instructions )

And WHEN thats 100% right, Ill say implement and update as you go. The update as you go forces it to recognize and remember the scope of the task.

Greatest points of failure in the system is misalignment. Ethics teams got that right. It compounds FAST if allowed. you let them assume things, they state assumptions as facts, that becomes what other agents read and you get true chaos unchecked.

I started rebuilding claude code from scratch literally because they block us from accessing system prompts and I NEED these agents to stop lying to me about things that are not done or assumed, which highlights the true chaos possible when applied to system critical operations in governance or at scale.

I also built my own tool like codex for managing agent tasks and making this simpler, but getting them to use it without getting confused is still a gap.

Let me know if you have any other questions. I am performing the work of 20 Engineers as of today, rewrote 2 years of back end code that required a team of 2 engineers full time work in 4 weeks by myself with this system... so I am, I guess quite good at it.

I need to push my edges further into this latest tech, have not tried codex cli or the new tool yet.

replies(1): >>44007336 #

14. IXCoach ◴[16 May 25 16:30 UTC] No.44007336[source]▶

Its a total of about 30 snippets, avg 6 paragraphs long, that I have to inject. for each role switch it goes through i have to re inject them.

its a pain but it works.

Even TDD it will hallucinate the mocks without management. and hallucinate the requirements. Each layer has to be checked atomically, but the text expander snippets done right can get it close to 75% right.

My main project faces 5000 users so I cant let the agents run freely, whereas with isolated projects in separate repos I can let them run more freely, then review in gitkraken before committing.

replies(1): >>44008428 #

15. arkmm ◴[16 May 25 16:35 UTC] No.44007383[source]▶

>>44006680 (TP) #

I think most code these days is boilerplate, though the composition of boilerplate snippets can become something unique and differentiated.

16. evilduck ◴[16 May 25 17:06 UTC] No.44007699[source]▶

>>44006680 (TP) #

It may depend on what you consider boilerplate. I use them quite a bit for scripting outside of direct product code development. Essentially, AI coding tools have moved this chart's decision making math for me: https://xkcd.com/1205/ The cost to automate manual tasking is now significantly lower so I end up doing more of it.

17. Rudybega ◴[16 May 25 18:24 UTC] No.44008428{3}[source]▶

You could just use something like roo code with custom modes rather than manually injecting them. The orchestrator mode can decide on the other appropriate modes to use for subtasks.

You can customize the system prompts, baseline propmts, and models used for every single mode and have as many or as few as you want.

18. lispisok ◴[16 May 25 19:42 UTC] No.44009108[source]▶

>>44006680 (TP) #

A lot of people are deeply invested in these things being better than they really are. From the OpenAI's and Google's spending $100s of billions EACH developing LLMs to VC backed startups promising their "AI agent" can replace entire teams of white collar employees. That's why your experience matches mine and every other developer I personally know but you see comments everywhere making much grander claims.

replies(2): >>44009789 #>>44009997 #

19. triMichael ◴[16 May 25 20:59 UTC] No.44009789[source]▶

I agree, but I'd add that it's not just the tech giants who want them to be better than they are, but also non-programmers.

IMO LLMs are actually pretty good at writing small scripts. First, it's much more common for a small script to be in the LLM's training data, and second, it's much easier to find and fix a bug. So the LLM actually does allow a non-programmer to write correct code with minimal effort (for some simple task), and then they are blown away thinking writing software is a solved problem. However, these kinds of people have no idea of the difference between a hundred line script where an error is easily found and isn't a big deal and a million line codebase where an error can be invisible and shut everything down.

Worst of all is when the two sides of tech-giants and non-programmers meet. These two sides may sound like opposites but they really aren't. In particular, there are plenty of non-programmers involved at the C-level and the HR levels of tech companies. These people are particularly vulnerable to being wowed by LLMs seemingly able to do complex tasks that in their minds are the same tasks their employees are doing. As a result, they stop hiring new people and tell their current people to "just use LLMs", leading to the current hiring crisis.

20. alfalfasprout ◴[16 May 25 21:37 UTC] No.44009987[source]▶

> I have now created 5 hyper narrow programs that are used daily by my company to do work. I am not a programmer and my company is not a tech company located in a tech bubble. We are a tiny company that does old school manufacturing.

OK, great.

> That you are trying to use LLMs to create giant sprawling codebase feature packed software packages that define the modern software landscape. What's being missed is that any one user might only utilize 5% of the code base on any given day. Software is written to accommodate every need every user could have in one package. Then the users just use the small slice that accommodates their specific needs.

With all due respect, the fact that you made a few small programs to help with your tasks is wonderful but this last statement alone rather disqualifies your expertise to make an assessment on software engineering in general.

There's a great number of reasons why codebases get large. Complex problems inherently come with complexity and scale in both code and integrations. You can choose to move the complexity around but never fully get rid of it.

replies(1): >>44010510 #

21. alfalfasprout ◴[16 May 25 21:39 UTC] No.44009997[source]▶

TBH, this website in the last few years has attracted an increasingly non-technical audience. And the field, in general, has attracted a lot of less experienced folks that don't understand the implications of what they're doing. I don't mean that as a diss-- but just a reflection of reality.

Indeed, even codex (and i've been using it prior to this release) is not remotely at the level of even a junior engineer outside of a set of tasks.

22. mupuff1234 ◴[16 May 25 22:52 UTC] No.44010510{3}[source]▶

But how much of the software industry is truly solving inherently complex problems?

At a very conservative guess I'd say no more than 10% (and my actual guess would be <1%)

23. kypro ◴[17 May 25 05:07 UTC] No.44012169[source]▶

>>44006680 (TP) #

Firstly, LLM chat interfaces != agentic coding platforms.

ChatGPT is good for asking questions about languages, SDKs, and APIs, or generating boilerplate, but it's useless if you want to give an AI a ticket and for it to raise PRs for you.

This is where you need agentic solutions like Codex which will be far more useful because they will actually have access to your codebase and a dev environment where they can test and debug changes.

They still do really dumb things, but a lot of this can be avoided if you prompt well and give it the right types of problems to solve.

In my experience at the moment there's a sweet spot with these agentic coding platforms which makes them useful for semi-complicated tasks – assuming you prompt well they can generate 90% of the code you need, then you just need to spend the extra 10% fixing it up before it's ready for prod.

Tasks too simple (a few lines) it's a waste of time. You spend longer prompting and going back and forth with the agent than it would take to just make the change yourself.

Then obviously very complicated tasks, especially tasks that require some thought around architecture and performance, coding agents really struggle with. Less because they can't do it, but because for certain problems simply meeting ACs is far less important than how the ACs are being met. Ideally here you want to get the architecture right first, then once that's in place you can break down the remaining work for the AI to pick up.

24. elyase ◴[17 May 25 13:33 UTC] No.44014213[source]▶

>>44006680 (TP) #

What you're missing is how to use the tools properly. With solid documentation, good project management practices, a well-organized code structure and tests, any junior engineer should be able to read up on your codebase, write linted code following your codebase style, verify it via tests and write you a report of what was done, challenges faced etc. State of the art coding agents will do that at superhuman speeds.

If you haven't set things up properly (important info lives only in people’s heads / meetings, tasks dont have clear acceptance criteria, ...) then you aren't ready for Junior Developers yet. You need to wait until your Coding Agents are at Senior level.