Most active commenters
  • godelski(5)

←back to thread

858 points cryptophreak | 17 comments | | HN request time: 1.347s | source | bottom
Show context
wiremine ◴[] No.42936346[source]
I'm going to take a contrarian view and say it's actually a good UI, but it's all about how you approach it.

I just finished a small project where I used o3-mini and o3-mini-high to generate most of the code. I averaged around 200 lines of code an hour, including the business logic and unit tests. Total was around 2200 lines. So, not a big project, but not a throw away script. The code was perfectly fine for what we needed. This is the third time I've done this, and each time I get faster and better at it.

1. I find a "pair programming" mentality is key. I focus on the high-level code, and let the model focus on the lower level code. I code review all the code, and provide feedback. Blindly accepting the code is a terrible approach.

2. Generating unit tests is critical. After I like the gist of some code, I ask for some smoke tests. Again, peer review the code and adjust as needed.

3. Be liberal with starting a new chat: the models can get easily confused with longer context windows. If you start to see things go sideways, start over.

4. Give it code examples. Don't prompt with English only.

FWIW, o3-mini was the best model I've seen so far; Sonnet 3.5 New is a close second.

replies(27): >>42936382 #>>42936605 #>>42936709 #>>42936731 #>>42936768 #>>42936787 #>>42936868 #>>42937019 #>>42937109 #>>42937172 #>>42937188 #>>42937209 #>>42937341 #>>42937346 #>>42937397 #>>42937402 #>>42937520 #>>42938042 #>>42938163 #>>42939246 #>>42940381 #>>42941403 #>>42942698 #>>42942765 #>>42946138 #>>42946146 #>>42947001 #
1. godelski ◴[] No.42937346[source]

  > I focus on the high-level code, and let the model focus on the lower level code.
Tbh the reason I don't use LLM assistants is because they suck at the "low level". They are okay at mid level and better at high level. I find it's actual coding very mediocre and fraught with errors.

I've yet to see any model understand nuance or detail.

This is especially apparent in image models. Sure, it can do hands but they still don't get 3D space nor temporal movements. It's great for scrolling through Twitter but the longer you look the more surreal they get. This even includes the new ByteDance model also on the front page. But with coding models they ignore context of the codebase and the results feel more like patchwork. They feel like what you'd be annoyed at with a junior dev for writing because not only do you have to go through 10 PRs to make it pass the test cases but the lack of context just builds a lot of tech debt. How they'll build unit tests that technically work but don't capture the actual issues and usually can be highly condensed while having greater coverage. It feels very gluey, like copy pasting from stack overflow when hyper focused on the immediate outcome instead of understanding the goal. It is too "solution" oriented, not understanding the underlying heuristics and is more frustrating than dealing with the human equivalent who says something "works" as evidenced by the output. This is like trying to say a math proof is correct by looking at just the last line.

Ironically, I think in part this is why chat interface sucks too. A lot of our job is to do a lot of inference in figuring out what our managers are even asking us to make. And you can't even know the answer until you're part way in.

replies(3): >>42937928 #>>42938207 #>>42938298 #
2. lucasmullens ◴[] No.42937928[source]
> But with coding models they ignore context of the codebase and the results feel more like patchwork.

Have you tried Cursor? It has a great feature that grabs context from the codebase, I use it all the time.

replies(3): >>42938048 #>>42939067 #>>42939425 #
3. pc86 ◴[] No.42938048[source]
I can't get the prompt because I'm on my work computer but I have about a three-quarter-page instruction set in the settings of cursor, it asks clarifying questions a LOT now, and is pretty liberal with adding in commented pseudo-code for stuff it isn't sure about. You can still trip it up if you try, but it's a lot better than stock. This is with Sonnet 3.5 agent chats (composer I think it's called?)

I actually cancelled by Anthropic subscription when I started using cursor because I only ever used Claude for code generation anyway so now I just do it within the IDE.

replies(2): >>42947006 #>>43012127 #
4. wiremine ◴[] No.42938207[source]
> Tbh the reason I don't use LLM assistants is because they suck at the "low level". They are okay at mid level and better at high level. I find it's actual coding very mediocre and fraught with errors.

That's interesting. I found assistants like Copilot fairly good at low level code, assuming you direct it well.

replies(1): >>42940634 #
5. yarekt ◴[] No.42938298[source]
> A lot of our job is to do a lot of inference in figuring out what our managers are even asking us to make

This is why I think LLMs can't really replace developers. 80% of my job is already trying to figure out what's actually needed, despite being given lots of text detail, maybe even spec, or prototype code.

Building the wrong thing fast is about as useful as not building anything at all. (And before someone says "at least you now know what not to do"? For any problem there are infinite number of wrong solutions, but only a handful of ones that yield success, why waste time trying all the wrong ones?)

replies(2): >>42939668 #>>42940612 #
6. godelski ◴[] No.42939067[source]
I have not. But I also can't get the general model to work well in even toy problems.

Here's a simple example with GPT-4o: https://0x0.st/8K3z.png

It probably isn't obvious in a quick read, but there are mistakes here. Maybe the most obvious is that how `replacements` is made we need to intelligently order. This could be fixed by sorting. But is this the right data structure? Not to mention that the algorithm itself is quite... odd

To give a more complicated example I passed the same prompt from this famous code golf problem[0]. Here's the results, I'll save you the time, the output is wrong https://0x0.st/8K3M.txt (note, I started command likes with "$" and added some notes for you)

Just for the heck of it, here's the same thing but with o1-preview

Initial problem: https://0x0.st/8K3t.txt

Codegolf one: https://0x0.st/8K3y.txt

As you can see, o1 is a bit better on the initial problem but still fails at the code golf one. It really isn't beating the baseline naive solution. It does 170 MiB/s compared to 160 MiB/s (baseline with -O3). This is something I'd hope it could do really well on given that this problem is rather famous and so many occurrences of it should show up. There's tons of variations out there and It is common to see parallel fizzbuzz in a class on parallelization as well as it can teach important concepts like keeping the output in the right order.

But hey, at least o1 has the correct output... It's just that that's not all that matters.

I stand by this: evaluating code based on output alone is akin to evaluating a mathematical proof based on the result. And I hope these examples make the point why that matters, why checking output is insufficient.

[0] https://codegolf.stackexchange.com/questions/215216/high-thr...

Edit: I want to add that there's also an important factor here. The LLM might get you a "result" faster, but you are much more likely to miss the learning process that comes with struggling. Because that makes you much faster (and more flexible) not just next time but in many situations where even a subset is similar. Which yeah, totally fine to glue shit together when you don't care and just need something, but there's a lot of missed value if you need to revisit any of that. I do have concerns that people will be plateaued at junior levels. I hope it doesn't cause seniors to revert to juniors, which I've seen happen without LLMs. If you stop working on these types of problems, you lose the skills. There's already an issue where we rush to get output and it has clear effects on the stagnation of devs. We have far more programmers than ever but I'm not confident we have a significant number more wizards (the percentage of wizards is decreasing). There's fewer people writing programs just for fun. But "for fun" is one of our greatest learning tools as humans. Play is a common trait you see in animals and it exists for a reason.

7. troupo ◴[] No.42939425[source]
> It has a great feature that grabs context from the codebase, I use it all the time.

If only this feature worked consistently, or reliably even half of the time.

It will casually forget or ignore any and all context and any and all files in your codebase at random times, and you never know what set of files and docs it's working with at any point in time

8. TeMPOraL ◴[] No.42939668[source]
> For any problem there are infinite number of wrong solutions, but only a handful of ones that yield success, why waste time trying all the wrong ones?

Devil's advocate: because unless you're working in heavily dysfunctional organization, or are doing a live coding interview, you're not playing "guess the password" with your management. Most of the time, they have even less of a clue about how the right solution looks like! "Building the wrong thing" lets them diff something concrete against what they imagined and felt like it would be, forcing them to clarify their expectations and give you more accurate directions (which, being a diff against a concrete things, are less likely to be then misunderstood by you!). And, the faster you can build that wrong thing, the less money and time is burned to buy that extra clarity.

replies(3): >>42943092 #>>42943114 #>>43068049 #
9. godelski ◴[] No.42940612[source]

  > Building the wrong thing fast is about as useful as not building anything at all.
SAY IT LOUDER

Fully agree. Plus, you may be faster in the short term but you won't in the long run. The effects of both good code and bad code compound. "Tech debt" is just a fancy term for "compounding shit". And it is true, all code is shit, but it isn't binary; there is a big difference between being stepping in shit and being waist deep in shit.

I can predict some of the responses

  Premature optimization is the root of all evil
There's a grave misunderstanding in this adage[0], and I think many interpret it as "don't worry about efficiency, worry about output." But the context is that you shouldn't optimize without first profiling the code, not that you shouldn't optimize![1] I find it also funny revisiting this quote, because it seems like it is written by a stranger in a strange land, where programmers are overly concerned with optimizing their code. These days, I hear very little about optimization (except when I work with HPC people) except when people are saying to not optimize. Explains why everything is so sluggish...

[0] https://softwareengineering.stackexchange.com/a/80092

[1] Understanding the limitations of big O analysis really helps in understanding why this point matters. Usually when n is small, you can have worse big O and still be faster. But the constants we drop off often aren't a rounding error. https://csweb.wooster.edu/dbyrnes/cs200/htmlNotes/qsort3.htm

10. godelski ◴[] No.42940634[source]
I have a response to a sibling comment showing where GPT 4o and o1-preview do not yield good results.

  > assuming you direct it well.
But hey, I admit I might not be good at this. But honestly, I've found greater value in my time reading the docs than spending trying to prompt engineer my way through. And I've given a fair amount of time to trying to get good at prompting. I just can't get it to work.

I do think that when I'm coding with an LLM it _feels_ faster, but when I've timed myself, it doesn't seem that way. It just seems to be less effort (I don't mind the effort, especially because the compounding rewards).

11. bcoates ◴[] No.42943092{3}[source]
It's just incredibly inefficient if there's any other alternative.

Doing 4 sprints over 2 months to make a prototype in order to save 3 60 minute meetings over a week where you do a few requirements analysis/proposal review cycles.

replies(2): >>42946073 #>>42946781 #
12. godelski ◴[] No.42943114{3}[source]
I don't think you're disagreeing, in fact, I think you're agreeing. Ironically with the fact of either you or I being wrong demonstrating the difficulty of chat based UI communication. I believe yarekt would be in agreement with me that

  > you can't even know the answer until you're part way in.
Which it seems you do too. But for clarity, there's a big difference between building the /wrong/ thing and /not the right thing/. The underlying point of my comment is that not only is communication difficult, but the overall goals are ambiguous. That a lot of time should be dedicated to getting this right. Yes, that involves finding out what things are wrong and that is the sentiment behind the original meaning of "fail fast" but I think that term has come to mean something a bit different now. Moreover, I believe that there's just people not looking at details.

It is really hard to figure out what the right thing is. We humans don't do this just through chat. We experiment, discuss, argue, draw, and there's tons of inference and reliance upon shared understandings. There's a lot of associated context. You're right that a dysfunctional organization (not uncommon) is worse, but these things are still quite common in highly functioning organizations. Your explicit statement about management having even less of an idea of what the right solution is, is explicitly what we're pushing back against. Saying that that is a large part of a developer's job. I would argue that the main reason we have a better idea is due to our technical skills, our depth of knowledge, our experience. A compression machine (LLM) will get some of this, but there's a big gap when trying to get to the end point. Pareto is a bitch. We all know there is a huge difference between a demonstrating prototype and an actual product. That the amount of effort and resources are exponentially different. ML systems specifically struggle with detail and nuance, but this is the root of those resource differences.

I'll give an example for clarity. Considering the iPad, the existence of third party note taking apps can be interpreted of nothing short of Apple's failure. I mean for the love of god, you got the pencil and you can't pull up notes and interact with it like it is a piece of paper? It's how the damned thing is advertised! A third party note taking app should be interpreted by Apple as showing their weak points. But you can't even zoom on the notes app?! Sure, you can turn on the accessibility setting and zoom with triple tap (significantly diverging from the standard pinching gesture used literally everywhere else) but if you do this (assuming full screen) you are just zooming in on the portion of the actual screen and not zooming in the notes. So you get stupid results like not having access to your pen's settings. Which is extra important here given that the likely reason someone would zoom is to adjust details and certainly you're going to want to adjust the eraser size. What I'm trying to say is that there's a lot of low hanging fruit here that should be incredibly obvious were you to actually use the application, dog-fooding. Instead, Apple is dedicating time into hand writing recognition and equation solving, which in practice (at least in my experience) end up creating a more jarring experience and cause more editing. Though it is cool when it does work. I'd say that here, Apple is not building the right thing. They are completely out of touch with the actual goals and experiences of the users. It's not that they didn't build a perfect app, it is that they fail to build basic functionality.

But of course, Apple probably doesn't care. Because they significantly prioritize profits over building a quality product. These are orthogonal aspects and they can be simultaneously optimized. One should not need pick one over another and the truth is that our economics should ensure alignment, that quality begets profits and that one can't "hack" the metrics.

Apple is far from alone here though. I'd say this "low hanging infuriating bullshit" is actually quite common. In fact, I think it is becoming more common. I have argued here before about the need for more "grumpy developers." I think if you're not grumpy, you should be concerned. Our job is to find problems, break them down into a way that can be addressed, and to resolve them. The "grumpiness" here is a dissatisfaction with the state of things. Given that nothing is perfect, there should always be reason to be "grumpy." A good developer should be able to identify and fix problems without being asked. But I do think there's a worrying decline of (lack of!) "grumpy" types, and I have no doubt this is connected to the rapid rise of vaporware and shitty products.

Also, I notice you play Devil's advocate a lot. While I think it can be useful, I think it can be overused. It needs to drive home the key limitations to an argument, especially when they are uncomfortable. Though I think in our case, I'm the one making the argument that diverges from the norm.

13. TeMPOraL ◴[] No.42946073{4}[source]
Yeah, that would be stupid. I was thinking one order of magnitude less in terms of effort. If you can make a prototype in a day, it might deliver way more value than 3x 60 minute meetings. If you can make it in a week, where the proper implementation would take more than a month, that could still be a huge win.

I see this not as opposed, but as part of requirements analysis/review - working in the abstract, with imagination and prose and diagrams, it's too easy to make invalid assumptions without anyone realizing it.

14. andreasmetsala ◴[] No.42946781{4}[source]
> Doing 4 sprints over 2 months to make a prototype

That’s a lot of effort for a prototype that you should be throwing away even if it does the right thing!

Are you sure you’re not gold plating your prototypes?

15. slig ◴[] No.42947006{3}[source]
I'm very interested in your prompt and could you be so kind to paste it somewhere and link in your comment, please?
16. justneedaname ◴[] No.43012127{3}[source]
Also interested to see this
17. yarekt ◴[] No.43068049{3}[source]
Sure, but you’re always going to get better results by actively looking for the good solution rather than building something and hoping it’s right. (building a prototype is one tool in your toolbox)

We as developers are tasked with figuring out what the problem is, especially in cases where the client is wrong about their assumptions