Most active commenters
  • MarcelOlsz(12)
  • sarchertech(5)
  • (4)
  • perrygeo(3)
  • MrMcCall(3)
  • dartos(3)
  • glouwbug(3)
  • PaulHoule(3)
  • skydhash(3)

←back to thread

466 points 0x63_Problems | 105 comments | | HN request time: 3.194s | source | bottom
1. perrygeo ◴[] No.42138092[source]
> Companies with relatively young, high-quality codebases benefit the most from generative AI tools, while companies with gnarly, legacy codebases will struggle to adopt them. In other words, the penalty for having a ‘high-debt’ codebase is now larger than ever.

This mirrors my experience using LLMs on personal projects. They can provide good advice only to the extent that your project stays within the bounds of well-known patterns. As soon as your codebase gets a little bit "weird" (ie trying to do anything novel and interesting), the model chokes, starts hallucinating, and makes your job considerably harder.

Put another way, LLMs make the easy stuff easier, but royally screws up the hard stuff. The gap does appear to be widening, not shrinking. They work best where we need them the least.

replies(24): >>42138267 #>>42138350 #>>42138403 #>>42138537 #>>42138558 #>>42138582 #>>42138674 #>>42138683 #>>42138690 #>>42138884 #>>42139109 #>>42139189 #>>42140096 #>>42140476 #>>42140626 #>>42140809 #>>42140878 #>>42141658 #>>42141716 #>>42142239 #>>42142373 #>>42143688 #>>42143791 #>>42151146 #
2. RangerScience ◴[] No.42138267[source]
Eh, it’s been kinda nice to just hit tab-to-complete on things like formulaic (but comprehensive) test suites, etc.

I never wanted the LLM to take over the (fun) part - thinking through the hard/unusual parts of the problem - but you’re also not wrong that they’re needed the least for the boilerplate. It’s still nice :)

replies(2): >>42138489 #>>42138729 #
3. dcchambers ◴[] No.42138350[source]
Like most of us it appears LLMs really only want to work on greenfield projects.
replies(2): >>42138525 #>>42138978 #
4. ◴[] No.42138403[source]
5. hyccupi ◴[] No.42138489[source]
> It’s still nice :)

This is the thing about the kind of free advertising so many on this site provide for these llm corpos.

I’ve seen so many comparisons between “ai” and “stack overflow” that mirror this sentiment of “it’s still nice :)”.

Who’s laying off and replacing thousands of working staff for “still nice :)” or because of “stack overflow”?

Who’s hiring former alphabet agency heads to their board for “still nice :)”?

Who’s forcing these services into everything for “still nice :)”?

Who’s raising billions for “still nice :)”?

So while developers argue tooth and nail for these tools that they seemingly think everyone only sees through their personal lens of a “still nice :)” developer tool, the companies are leveraging that effort to oversell their product beyond the scope of “still nice :)”.

6. hyccupi ◴[] No.42138525[source]
Good joke, but the reality is they falter even more on truly greenfield projects.

See: https://news.ycombinator.com/item?id=42134602

replies(2): >>42138662 #>>42138664 #
7. TOGoS ◴[] No.42138537[source]
> They work best where we need them the least.

Just like most of the web frameworks and ORMs I've been forced to use over the years.

8. comboy ◴[] No.42138558[source]
Same experience, but I think it's going to change. As models get better, their context window keeps growing while mine stays the same.

To be clear, our context window can be really huge if you are living the project. But not if you are new to it or even getting back to it after a few years.

replies(1): >>42138736 #
9. graycat ◴[] No.42138582[source]
I suspected some of that, and your explanation looks more general and good.

Or, for a joke, LLMs plagiarize!

10. MrMcCall ◴[] No.42138662{3}[source]
That is because, by definition, their models are based upon the past. And woe unto thee if that training data was not pristine. Error propagation is a feature; it's a part of the design, unless one is suuuuper careful. As some have said, "Fools rush in."
replies(1): >>42140455 #
11. anthonyskipper ◴[] No.42138664{3}[source]
I agree with this. But the reason is that AI does better the more constrained it is, and existing codebases come with constraints.

That said, if you are using Gen AI without a advanced rag system feeding it lots of constraints and patterns/templates I wish you luck.

12. irrational ◴[] No.42138674[source]
I was recently assigned to work on a huge legacy ColdFusion backend service. I was very surprised at how useful AI was with code. It was even better, in my experience, than I've seen with python, java, or typescript. The only explanation I can come up with is there is so much legacy ColdFusion code out there that was used to train Copilot and whatever AI jetbrains uses for code completion that this is one of the languages they are most suited to assist with.
replies(4): >>42139225 #>>42139249 #>>42139393 #>>42139543 #
13. slt2021 ◴[] No.42138683[source]
maybe its a signal that you software should be restructured into modules that fit well-established patterns.

its like you are building website thats not using MVC and complain that LLM advice is garbage...

replies(1): >>42139787 #
14. anthonyskipper ◴[] No.42138690[source]
This is only partly true. AI works really well on very legacy codebases like cobol and mainframe, and it's very good at converting that to modern languages and architectures. It's all the stuff from like 2001-2015 that it gets weird on.
replies(1): >>42138720 #
15. dartos ◴[] No.42138720[source]
> AI works really well on very legacy codebases like cobol and mainframe

Any sources? Seems unlikely that LLMs would be good at something with so little training data in the widely available internet.

replies(1): >>42140055 #
16. perrygeo ◴[] No.42138729[source]
True, if you're using LLMs as a completion engine or to generate scaffolding it's still very useful! But we have to acknowledge that's by far the easiest part of programming. IDEs and deterministic dev tools have done that (very well) for decades.

The LLM gains are in efficiency for rote tasks, not solving the other hard problems that make up 98% of the day. The idea that LLMs are going to advance software in any substantial way seems implausible to me - It's an efficiency tool in the same category as other IDE features, an autocomplete search engine on steroids, not even remotely approaching AGI (yet).

replies(1): >>42143100 #
17. MrMcCall ◴[] No.42138736[source]
Here's the secret to grokking a software project: a given codebase is not understandable without understanding how and why it was built; i.e. if you didn't build it, you're not going to understand why it is the way it is.

In theory, the codebase should be, as it is, understandable (and it is, with a great deal of rigorous study). In reality, that's simply not the case, not for any non-trivial software system.

replies(3): >>42139319 #>>42139451 #>>42142633 #
18. jamil7 ◴[] No.42138884[source]
> This mirrors my experience using LLMs on personal projects. They can provide good advice only to the extent that your project stays within the bounds of well-known patterns.

I agree but I find its still a great productivity boost for certain tasks, cutting through the hype and figuring out tasks that are well suited to these tools and prompting optimially has taken me a long time.

replies(1): >>42139011 #
19. benatkin ◴[] No.42138978[source]
The site also suggests LLMs care a great deal one way or another.

"Unlock a codebase that your engineers and AI love."

https://www.gauge.sh/

I think they do often act opinionated and show some decision-making ability, so AI alignment really is important.

replies(1): >>42140471 #
20. pydry ◴[] No.42139011[source]
I hear people say this a lot but invariably the tasks end up being "things you shouldnt be doing".

E.g. pointing the AI at your code and getting it to write unit tests or writing more boilerplate, faster.

21. cheald ◴[] No.42139109[source]
The niche I've found for LLMs is for implementing individual functions and unit tests. I'll define an interface and a return (or a test name and expectation) and say "this is what I want this to do", and let the LLM take the first crack at it. Limiting the bounds of the problem to be solved does a pretty good job of at least scaffolding something out that I can then take to completion. I almost never end up taking the LLM's autocompletion at face value, but having it written out to review and tweak does save substantial amounts of time.

The other use case is targeted code review/improvement. "Suggest how I could improve this" fills a niche which is currently filled by linters, but can be more flexible and robust. It has its place.

The fundamental problem with LLMs is that they follow patterns, rather than doing any actual reasoning. This is essentially the observation made by the article; AI coding tools do a great job of following examples, but their usefulness is limited to the degree to which the problem to be solved maps to a followable example.

replies(3): >>42140322 #>>42143531 #>>42143847 #
22. cloverich ◴[] No.42139189[source]
For me same experience but opposite conclusion. LLM saves me time by being excellent at yak shaving, letting me focus on the things that truly need my attention.

It would be great if they were good at the hard stuff too, but if I had to pick, the basics is where i want them the most. My brain just really dislikes that stuff, and i find it challenging to stay focused and motivated on those things.

replies(2): >>42140617 #>>42141639 #
23. randomdata ◴[] No.42139225[source]
Perhaps it is the reverse: That ColdFusion training sources are limited, so it is more likely to converge on a homogenization?

While, causally, we usually think of a programming language as being one thing, but in reality a programming language generally only specifies a syntax. All of the other features of a language emerge from the people using them. And because of that, two different people can end up speaking two completely different languages even when sharing the same syntax.

This is especially apparent when you witness someone who is familiar with programming in language X, who then starts learning language Y. You'll notice, at least at first, they will still try to write their programs in language X using Y syntax, instead of embracing language Y in all its glory. Now, multiply that by the millions of developers who will touch code in a popular language like Python, Java, or Typescript and things end up all over the place.

So while you might have a lot more code to train on overall, you need a lot more code for the LLM to be able to discern the different dialects that emerge out of the additional variety. Quantity doesn't imply quality.

replies(1): >>42139415 #
24. mdtancsa ◴[] No.42139249[source]
similar experience with perl scripts being re-written into golang. Crazy good experience with Claude
25. thfuran ◴[] No.42139319{3}[source]
So your secret to understanding code is: Abandon hope all ye who enter here?
replies(2): >>42139724 #>>42142422 #
26. cpeterso ◴[] No.42139393[source]
But where did these companies get the ColdFusion code for their training data? Since ColdFusion is an old language and used for backend services, how much ColdFusion code is open source and crawlable?
replies(2): >>42140959 #>>42141919 #
27. cpeterso ◴[] No.42139415{3}[source]
I wonder what a language designed as a target for LLM-generated code would look like? What semantics and syntax would help the LLM generate code that is more likely to be correct and maintainable by humans?
replies(1): >>42143160 #
28. gwervc ◴[] No.42139451{3}[source]
Too bad most projects don't document any of those decisions.
29. eqvinox ◴[] No.42139543[source]
That's great, but a sample size of 1, and AI utility is also self-confirmation-biasing. If the AI stops providing useful output, you stop using it. It's like "what you're searching is always in the last place you look". After you recognize AI's limits, most people wouldn't keep trying to ask it to do things they've learned it can't do. But still, there's an area of things it does, and a (ok, fuzzy) boundary of its capabilities.

Basically, for any statement about AI helpfulness, you need to quantify how far it can help you. Depending on your personality, anything else is likely either always a success (if you have a positive outlook) or a failure (if you focus on the negative).

30. marcosdumay ◴[] No.42139724{4}[source]
Oh, you will understand why things were built. It's inevitable.

And all of that understanding will come from people complaining about you fixing a bug.

31. marcosdumay ◴[] No.42139787[source]
No, you shouldn't restructure your software into highly-repetitive noise so that a dumb computer can guess what comes next.
replies(1): >>42141132 #
32. true_religion ◴[] No.42140055{3}[source]
LLMs are good at taking the underlying structure of one medium and repeating it using another medium.
replies(1): >>42146828 #
33. munk-a ◴[] No.42140096[source]
> Put another way, LLMs make the easy stuff easier, but royally screws up the hard stuff.

This is my experience with generation as well - but I still don't trust it for the easy stuff and thus the model ends up being a hindrance in all scenarios. It is much easier for me to comprehend something I'm actively writing so making sure a generative AI isn't hallucinating costs more than me just writing it myself in the first place.

34. MarcelOlsz ◴[] No.42140322[source]
Can't tell you how much I love it for testing, it's basically the only thing I use it for. I now have a test suite that can rebuild my entire app from the ground up locally, and works in the cloud as well. It's a huge motivator actually to write a piece of code with the reward being the ability to send it to the LLM to create some tests and then seeing a nice stream of green checkmarks.
replies(3): >>42140464 #>>42140879 #>>42143641 #
35. Terr_ ◴[] No.42140455{4}[source]
Or, in comic form: https://www.smbc-comics.com/comic/rise-of-the-machines
36. beeflet ◴[] No.42140471{3}[source]
Remember to tip your LLM

https://minimaxir.com/2024/02/chatgpt-tips-analysis/

37. zer8k ◴[] No.42140476[source]
> the model chokes, starts hallucinating, and makes your job considerably harder.

Coincidentally this also happens with developers in unfamiliar territory.

replies(1): >>42142868 #
38. davidsainez ◴[] No.42140617[source]
Yep, I'm building a dev tool that is based on this principle. Let me focus on the hard stuff, and offload the details to an AI in a principled manner. The current crop of AI dev tools seem to fall outside of this sweet spot: either they try to do everything, or act as a smarter code completion. Ideally I will spend more time domain modeling and less time "coding".
39. glouwbug ◴[] No.42140626[source]
Ironically enough I’ve always found LLMs work best when I don’t know what I’m doing
replies(2): >>42141007 #>>42141805 #
40. yieldcrv ◴[] No.42140809[source]
as the context windows get larger and the UX for analyzing multiple files gets better, I’ve found them to be pretty good

But they still fail at devops because so many config scripts are at never versions than the training set

41. ◴[] No.42140878[source]
42. highfrequency ◴[] No.42140879{3}[source]
> I now have a test suite that can rebuild my entire app from the ground up

What does this mean?

replies(1): >>42141059 #
43. PaulHoule ◴[] No.42140932{4}[source]
I had Codeium add something to a function that added a new data value to an object. Unbidden it wrote three new tests, good tests. I wrote my own test by cutting and pasting a test it wrote with a modification, it pointed out that I didn’t edit the comment so I told it to do so.

It also screwed up the imports of my tests pretty bad, some imports that worked before got changed for no good reason. It replaced the JetBrains NotNull annotation with a totally different annotation.

It was able to figure out how to update a DAO object when I added a new field. It got the type of the field wrong when updating the object corresponding to a row in that database column even though it wrote the liquibase migration and should have known the type —- we had chatted plenty about that migration.

It got many things right but I had to fix a lot of mistakes. It is not clear that it really saves time.

replies(2): >>42141053 #>>42141069 #
44. irrational ◴[] No.42140959{3}[source]
That's a good question. I presume there is some way to check github for how much code in each language is available on it.
45. mindok ◴[] No.42141007[source]
Is that because you can’t judge the quality of the output?
replies(1): >>42150194 #
46. imp0cat ◴[] No.42141053{5}[source]
Let's be clear here, Codeium kinda sucks. Yeah, it's free and it works, somewhat. But I wouldn't trust it much.
47. MarcelOlsz ◴[] No.42141059{4}[source]
Sorry, should have been more clear. Firebase is (or was) a PITA when I started the app I'm working on a few years ago. I have a lot of records in my db that I need to validate after normalizing the data. I used to have an admin page that spit out a bunch of json data with some basic filtering and self-rolled testing that I could verify at a glance.

After a few years off from this project, I refactored it all, and part of that refactoring was building a test suite that I can run. When ran, it will rebuild, normalize, and verify all the data in my app (scraped data).

When I deploy, it will also run these tests and then email if something breaks, but skip the seeding portion.

I had plans to do this before but the firebase emulator still had a lot of issues a few years ago, and refactoring this project gave me the freedom to finally build a proper testing environment and make my entire app make full use of my local firebase emulator without issue.

I like giving it my test cases in plain english. It still gets them wrong sometimes but 90% of the time they are good to go.

48. MarcelOlsz ◴[] No.42141069{5}[source]
Try using Cursor with the latest claude-3-5-sonnet-20241022.
replies(2): >>42141154 #>>42148341 #
49. MarcelOlsz ◴[] No.42141096{4}[source]
It's containerized and I have a script that takes care of everything from the ground up :) I've tested this on multiple OS' and friends computers. I'm thankful to old me for writing a readme for current me lol.

>Please tell me this is satire.

No. I started doing TDD. It's fun to think about a piece of functionality, write out some tests, and then slowly make it pass. Removes a lot of cognitive load for me and serves as a micro todo. It's also nice that when you're working and think of something to add, you can just write out a quick test for it and add it to kanban later.

I can't tell you how many times I've worked on projects that are gigantic in complexity and don't do any testing, or use typescript, or both. You're always like 25% paranoid about everything you do and it's the worst.

replies(1): >>42145900 #
50. slt2021 ◴[] No.42141132{3}[source]
I am proponent of Clean and Simple architecture that follows standard patterns.

because they are easier to maintain, there should be no clever tricks or arch.

all software arch should be boring and simple, with as few tricks as possible, unless it is absolutely warranted

replies(2): >>42142588 #>>42144194 #
51. PaulHoule ◴[] No.42141154{6}[source]
Unfortunately I “think different” and use Windows. I use Microsoft Copilot and would say it is qualitatively similar to codeium in quality, a real quantitative eval would be a lot of work.
replies(1): >>42141165 #
52. MarcelOlsz ◴[] No.42141165{7}[source]
Cursor (cursor.com) is just a vscode wrapper, should work fine with Windows. If you're already in the AI coding space I seriously urge you to at least give it a go.
replies(1): >>42141385 #
53. PaulHoule ◴[] No.42141385{8}[source]
I'll look into it.

I'll add that my experience with the Codium plugin for IntelliJ is night and day different from the Windsurf editor from Codium.

The first one "just doesn't work" and struggles to see files that are in my project, the second basically works.

replies(1): >>42141405 #
54. MarcelOlsz ◴[] No.42141405{9}[source]
You can also look into https://www.greptile.com/ to ask codebase questions. There's so many AI coding tools out there now. I've heard good things about https://codebuddy.ca/ as well (for IntelliJ) and https://www.continue.dev/ (also for IntelliJ).

>The first one "just doesn't work"

Haha. You're on a roll.

55. imiric ◴[] No.42141639[source]
> LLM saves me time by being excellent at yak shaving, letting me focus on the things that truly need my attention.

But these tools often don't generate working, let alone bug-free, code. Even for simple things, you still need to review and fix it, or waste time re-prompting them. All this takes time and effort, so I wonder how much time you're actually saving in the long run.

56. archy_ ◴[] No.42141658[source]
Ive noticed the same and wonder if this is the natural result of public codebases on average being simpler since small projects will always outnumber bigger ones (at least if you ignore forks with zero new commits)

If high quality closed off codebases were used in training, would we see an improvement in LLM quality for more complex use cases?

57. yodsanklai ◴[] No.42141716[source]
I use ChatGPT the most when I need to make a small change in a language I'm not fluent in, but I have a clear understanding of the project and what I'm trying to do. Example: "write a function that does this and this in Javascript". It's essentially a replacement of stack overflow.

I never use it for something that really requires knowledge of the code base, so the quality of the code base doesn't really matter. Also, I don't think it has ever provided me something I wouldn't have been able to do myself pretty quickly.

58. hambandit ◴[] No.42141805[source]
I find this perspective both scary and exciting. I'm curious, how do you validate the LLM's output? If you have a way to do this, and it's working. Then that's amazing. If you don't, how are you gauging "work best"?
replies(1): >>42150241 #
59. PeterisP ◴[] No.42141919{3}[source]
I'm definitely assuming that they don't limit their training data to what is open source and crawlable.
60. fny ◴[] No.42142239[source]
> They work best where we need them the least.

Au contraire. I hate writing boilerplate. I hate digging through APIs. I hate typing the same damn thing over and over again.

The easy stuff is mind numbing. The hard stuff is fun.

replies(1): >>42142519 #
61. kemiller ◴[] No.42142373[source]
This is true, but I look at it differently. It makes it easier to automate the boring or annoying. Gotta throw up an admin interface? Need to write more unit tests? Need a one-off but complicated SQL query? They tend to excel at these things, and it makes me more likely to do them, while keeping my best attention for the things that really need me.
62. MrMcCall ◴[] No.42142422{4}[source]
Hard work is no secret, it's just avoided by slackers at all costs :-)

What I'm really saying is that our software development software is missing a very important dimension.

63. skydhash ◴[] No.42142519[source]
You write these once (or zero time) by using a scaffolding template, a generator, or snippets.
replies(1): >>42143658 #
64. skydhash ◴[] No.42142588{4}[source]
Simplicity is hard. And difficulty is what almost everyone using LLMs is trying to avoid. More code breed complexity.

I read somewhere that 1/6 of the time should be allocated to refactoring (every 6th cycle). I wonder how that should be done with LLMs.

replies(1): >>42143084 #
65. xnx ◴[] No.42142633{3}[source]
LLMs might be a good argument for documenting more of the "why" in code comments.
66. perrygeo ◴[] No.42142868[source]
I often think of LLMs as a really smart junior developer - full of answers, half correct, with zero wisdom but 100% confidence

I'd like to think most developers know how to say "I don't know, let's do some research" but in reality, many probably just take a similar approach to the LLM - feign competence and just hack out whatever is needed for today's goal, don't worry about tomorrow.

replies(1): >>42151916 #
67. valenterry ◴[] No.42143084{5}[source]
Exactly that. LLMs generate a lot of simple and dumb code fast. Then you need to refactor it and you can't because LLMs are still very bad at that. They can only refactor locally with a very limited scope, not globally.

Good luck to anyone having to maintain legacy LLM-generated codebases in the future, I won't.

replies(1): >>42146835 #
68. valenterry ◴[] No.42143100{3}[source]
> The idea that LLMs are going to advance software in any substantial way seems implausible to me

I disagree. They won't do that for existing developers. But they will make it so that tech-savy people will be able to do much more. And they might even make it so that one-off customization per person will become feasable.

Imagine you want to sort hackernews comments by number of character inline in your browser. Tell the AI to add this feature and maybe it will work (just for you). That's some ways I can see substantial changes happen in the future.

69. eru ◴[] No.42143160{4}[source]
Perhaps something like Cobol? (Shudder.)
70. acrooks ◴[] No.42143531[source]
Yes this is the same for me. I’ve shifted my programming style so now I just write function signatures and let the AI do the rest for me. It has been a dream and works consistently well.

I’ll also often add hints at the top of the file in the form of comments or sample data to help keep it on the right track.

replies(1): >>42145308 #
71. rr808 ◴[] No.42143641{3}[source]
I struggle to get github copilot to create any unit tests that provide any value. How to you get it to create really useful tests?
replies(2): >>42144840 #>>42144903 #
72. fendy3002 ◴[] No.42143658{3}[source]
And now LLM write these for me, it's relaxing
73. LargeWu ◴[] No.42143688[source]
One description of the class of problems LLM's are a good fit for is anything at which you could throw an army of interns. And this seems consistent with that.
74. antonvs ◴[] No.42143791[source]
> They work best where we need them the least.

I disagree, but it’s largely a matter of expectations. I don’t expect them to solve hard problems for me. That’s currently still my job. But when I’m writing new code, even for a legacy system, they can save a lot of time in getting the initial coding done, helping write comments, unit tests, and so on.

It’s not doing difficult work, but it saves a lot of toil.

75. nox101 ◴[] No.42143847[source]
Can you give some examples? What LLM? What code? What tests?

As a test I just asked "ChatGPT 4o with canvas" to "Can you write a set of tests to test glBufferData and all of its edge cases?"

glBufferData is a 32 year old API so there's clearly plenty of examples for to have looked it. There are even multiple public tests for it including the official tests that are open sources and so easily scannable. It failed

It wrote 8 tests, 7 of those tests were wrong in that it did something wrong intentionally then asserted it go no error. It wasn't close to comprehensive. It didn't test the function actually put data in the buffer for example, nor did it check the set of valid enums to see that they work. Nor did it check that the target parameter actually works and affects the correct buffer bound to that target.

This is my experience with LLMs for code so far. I do get answers quicker from LLMs sometimes for tech questions vs searching via Google and reading stack overflow. But that's only sometimes. As a recent example, I was trying to add TypeScript types some JavaScript and it failed. I went round and round tell it it failed but it got stuck in a loop and just kept saying "Oh, sorry. How about this -- repeat of previous code"

replies(2): >>42144893 #>>42145945 #
76. lmm ◴[] No.42144194{4}[source]
A pattern is a structured way of working around a language deficiency. Good code does not need patterns or architecture, it expresses the essence of the actual business problem and no more. Such software is also significantly easier to maintain if you measure maintainability against how much functionality the software implements rather than how many lines of code it is. Unfortunately the latter is very common, and there is probably a bright future in using LLMs to edit masses of LLM-copy-pasted code as a poor man's substitute for doing it right.
77. BillyTheKing ◴[] No.42144840{4}[source]
Would recommend to try out anthropic sonnet 3.5 for this one - usually generates decent unit tests for reasonably sized functions
78. wruza ◴[] No.42144893{3}[source]
Wait, wait. You ought to write tests for javascript react html form validation boilerplate. Not that.

/s aside, it’s what we all experience too. There’s a great divide between programming pre-around-2015 and thereafter. LLMs can only do recent programming, which is a product of tons of money getting loaded into the industry and creating jobs that made no sense ten years ago. Basically, the more repetitive boilerplate patterns configuration options import blocks row-obj-dto-obj conversion typecheck bullshit you write per day, the more LLMs help. I mean, one could abstract all that away using regular programming, but how would they sell their work for $^6 an AI for $^9 then?

Just yesterday, after reading yet another “oh you must try again” comment, I asked 4o about how to stop puppeteer from dumping errors into console and exit gracefully when I close the headful browser (all logs and code provided). Right away it slided into nonsense. I always finish my chats with what I think about it uncut, just in case someone uses these for further learning.

79. MarcelOlsz ◴[] No.42144903{4}[source]
I use claude-3-5-sonnet-20241022 with a very explicit .cursorrules file with the cursor editor.
replies(1): >>42145845 #
80. eesmith ◴[] No.42145308{3}[source]
Here's one I wrote the other day which took a long time to get right. I'm curious on how well your AI can do, since I can't imagine it does a good job at it.

  # Given a data set of size `size' >= 0, and a `text` string describing
  # the subset size, return a 2-element tuple containing a text string
  # describing the complement size and the actual size as an integer. The
  # text string can be in one of four forms (after stripping leading and
  # trailing whitespace):
  #
  #  1) the empty string, in which case return ("", 0)
  #  2) a stringified integer, like "123", where 0 <= n <= size, in
  #   which case return (str(size-int(n)), size-int(n))
  #  3) a stringified decimal value like "0.25" where 0 <= x <= 1.0, in
  #   which case compute the complement string as str(1 - x) and
  #   the complement size as size - (int(x * size)). Exponential
  #   notation is not supported, only numbers like "3.0", ".4", and "3.14"
  #  4) a stringified fraction value like "1/3", where 0 <= x <= 1,
  #   in which case compute the complement string and value as #3
  #   but using a fraction instead of a decimal. Note that "1/2" of
  #   51 must return ("1/2", 26), not ("1/2", 25).
  #
  # Otherwise, return ("error", -1)

  def get_complement(text: str, size: int) -> tuple[str, int]:
    ...

For examples:

  get_complement("1/2", 100) == ("1/2", 50)
  get_complement("0.6", 100) == ("0.4", 40)
  get_complement("100", 100) == ("0", 0)
  get_complement("0/1", 100) == ("1/1", 100)
Some of the harder test cases I came up were:

get_complement("0.8158557553804697", 448_525_430): this tests the underlying system uses decimal.Decimal rather than a float, because float64 ends up on a 0.5 boundary and applies round-half-even resulting in a different value than the true decimal calculation, which does not end up with a 0.5. (The value is "365932053.4999999857944710")

get_complement("nan", 100): this is a valid decimal.Decimal but not allowed by the spec.

get_complement("1/0", 100): handle division-by-zero in fractions.Fraction

get_complement("0.", 100): this tests that the string complement is "1." or "1.0" and not "1"

get_complement("0.999999999999999", 100): this tests the complement is "0.000000000000001" and not "1E-15".

get_complement("0.5E0", 100): test that decimal parsing isn't simply done by decimal.Decimal(size) wrapped in an exception handler.

Also, this isn't the full spec. The real code reports parse errors (like recognizing the "1/" is an incomplete fraction) and if the value is out of range it uses the range boundary (so "-0.4" for input is treated as "0.0" and the complement is "1.0"), along with an error flag so the GUI can display the error message appropriately.

replies(1): >>42145682 #
81. acrooks ◴[] No.42145682{4}[source]
I suspect certain domains have higher performance than others. My normal use cases involve API calls, database calls, data transformation and AI fairly consistently does what I want. But in that space there are very repeatable patterns.

Also with your example above I probably would break the function down into smaller parts, for two reasons 1) you can more easily unit test the components; 2) generally I find AI performs better with more focused problems.

So I would probably first write a signature like this:

  # input examples = "1/2" "100" "0.6" "0.99999" "0.5E0" "nan"
  def string_ratio_to_decimal(text: str) -> number
Pasting that into Claude, without any other context, produces this result: https://claude.site/artifacts/58f1af0e-fe5b-4e72-89ba-aeebad...
replies(1): >>42146546 #
82. ponector ◴[] No.42145845{5}[source]
Can you share your .cursorrules? For me cursor is not much better than autocomplete, but I'm writing mostly e2e tests.
replies(1): >>42146031 #
83. sarchertech ◴[] No.42145900{5}[source]
>It's a huge motivator actually to write a piece of code with the reward being the ability to send it to the LLM to create some tests and then seeing a nice stream of green checkmarks.

Yeah that’s not TDD.

replies(1): >>42146001 #
84. Aeolun ◴[] No.42145945{3}[source]
If you asked me to write tests with such a vague definition I’d also have issues writing them though. It’ll work a lot better if you tell it what you want it to validate I think.
85. MarcelOlsz ◴[] No.42146001{6}[source]
Don't you have a book to get to writing instead of leaving useless comments? Haha.
replies(1): >>42146183 #
86. MarcelOlsz ◴[] No.42146031{6}[source]
You can find a bunch on https://cursor.directory/.
87. sarchertech ◴[] No.42146183{7}[source]
More importantly I have a French cleat wall to finish, a Christmas present to make for my wife, and a toddler and infant to keep from killing themselves.

But I also have a day job and I can’t even begin to imagine how much extra work someone doing “TDD” by writing a function and then fixing it in place with a whole suite of generated tests would cause me.

I’m fine with TDD. I do it myself fairly often. I also go back in and delete the tests that I used to build it that aren’t actually going to be useful a year from now.

replies(1): >>42146498 #
88. MarcelOlsz ◴[] No.42146498{8}[source]
Like I said above, I like the ability to scaffold tests using english and tweaking from there. I'm still not sure what point you're trying to make.
replies(1): >>42148139 #
89. eesmith ◴[] No.42146546{5}[source]
> I probably would break the function down into smaller parts

Sure. Internally I have multiple functions. Though I don't like unit testing below the public API as it inhibits refactoring and gives false coverage feedback, so all my tests go through the main API.

> Pasting that into Claude, without any other context

The context is the important part. Like the context which says "0.5E0" and "nan" are specifically not supported, and how the calculations need to use decimal arithmetic, not IEEE 754 float64.

Also, the hard part is generating the complement with correct formatting, not parsing float-or-fraction, which is first-year CS assignment.

> # Handle special values

Python and C accept "Infinity" as an alternative to "Inf". The correct way is to defer to the underlying system then check if the returned value is infinite or a NaN. Which is what will happen here because when those string checks fail, and the check for "/" fails, it will correctly process through float().

Yes, this section isn't needed.

> # Handle empty string

My spec says the empty string is not an error.

> numerator, denominator = text.split("/"); num = float(numerator); den = float(denominator)

This allows "1.2/3.4" and "inf/nan", which were not in the input examples and therefore support for them should be interpreted as accidental scope creep.

They were also not part of the test suite, which means the tests cannot distinguish between these two clearly different implementations:

  num = float(numerator)
  den = float(denominator)
and:

  num = int(numerator)
  den = int(denominator)
Here's a version which follows the same style as the linked-to code, but is easier to understand:

    if not isinstance(text, str):
        return None
    
    # Remove whitespace
    text = text.strip()
    
    # Handle empty string
    if not text:
        return None

    # Handle ratio format (e.g., "1/2")
    if "/" in text:
        try:
            numerator, denominator = text.split("/")
            num = int(numerator)
            den = int(denominator)
            if den == 0:
                return float("inf") if num > 0 else float("-inf") if num < 0 else float("nan")
            return num / den
        except ValueError:
            return None

    # Handle regular numbers (inf, nan, scientific notation, etc.)
    try:
        return float(text)
    except ValueError:
        return None
It still doesn't come anywhere near handling the actual problem spec I gave.
90. dartos ◴[] No.42146828{4}[source]
Assuming both mediums are reasonably well represented in the dataset, which brings me back to my comment
91. dartos ◴[] No.42146835{6}[source]
I’ve noticed LLMs quickly turn to pulling in dependencies and making complicated code
replies(1): >>42151443 #
92. sarchertech ◴[] No.42148139{9}[source]
Your original point was that it was great to “write some code then send it to the LLM to create tests.”

That’s not test driven development.

replies(1): >>42151036 #
93. lubujackson ◴[] No.42148341{6}[source]
Seconding Cursor. I have a friend who used Copilot 6 mo. ago and found it vaguely helpful... but turned him on to Cursor and it's a whole new ballgame.

Cross between actually useful autocomplete, personalized StackOverflow and error diagnosis (just paste and error message in chat). I know I am just scratching the usefulness and I pretty much never do changes across multiple files, but I definitely see firm net positives at this point.

94. glouwbug ◴[] No.42150194{3}[source]
Exactly
95. glouwbug ◴[] No.42150241{3}[source]
I gauge what work's best if I can already do what I am asking it to do, and that comes from years of studying and trial and error experience without LLMs. I have no way of verifying what's a hallucination unless I am an expert
96. MarcelOlsz ◴[] No.42151036{10}[source]
Sure if you want to take the absolute least charitable interpretation of what I said lol.
replies(1): >>42161327 #
97. hunterbrooks ◴[] No.42151146[source]
LLM's get relatively better at read-heavy operations (ex: code review) than write-heavy operations (ex: code generation) as codebases become less idiomatic.

I'm a cofounder at www.ellipsis.dev - we tried to build code generation for a LONG time before we realized that AI Code Review is way more doable with SOTA

98. skydhash ◴[] No.42151443{7}[source]
I'm sure they do great for scripts and other stuff. But the few times I tried, they always go for the most complicated solutions. I prefer my scripts to grow organically. Why automate something if I don't even know how it's done in the first place? (Unless someone else is maintaining the solution)
99. namaria ◴[] No.42151916{3}[source]
Nah LLMs are nothing like really smart junior developers.

Really smart junior developers actually have a shot at learning better and moving on from this stage.

100. sarchertech ◴[] No.42161327{11}[source]
“write a piece of code with the reward being the ability to send it to the LLM to create some tests and then seeing a nice stream of green checkmarks”

You write code, then you send the code to the LLM to create tests for you.

How can this possibly be interpreted to mean the reverse?

That you write tests first by asking the LLM in English to help you without “sending the code” you wrote because you haven’t written it yet. Then you use those tests to help you write the code.

Now if you misspoke then my comment isn’t relevant to your situation, but don’t pretend that I somehow interpreted what you said uncharitably. There’s no other way to interpret it.

replies(1): >>42165498 #
101. MarcelOlsz ◴[] No.42165498{12}[source]
Ok you win.
replies(1): >>42165933 #
102. sarchertech ◴[] No.42165933{13}[source]
Thanks
replies(2): >>42166456 #>>42166489 #
103. ◴[] No.42166456{14}[source]
104. ◴[] No.42166489{14}[source]
105. dang ◴[] No.42167974{4}[source]
"Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."

https://news.ycombinator.com/newsguidelines.html