Most active commenters

    ←back to thread

    Using LLMs at Oxide

    (rfd.shared.oxide.computer)
    694 points steveklabnik | 27 comments | | HN request time: 0.198s | source | bottom
    1. mcqueenjordan ◴[] No.46178624[source]
    As usual with Oxide's RFDs, I found myself vigorously head-nodding while reading. Somewhat rarely, I found a part that I found myself disagreeing with:

    > Unlike prose, however (which really should be handed in a polished form to an LLM to maximize the LLM’s efficacy), LLMs can be quite effective writing code de novo.

    Don't the same arguments against using LLMs to write one's prose also apply to code? Was this structure of the code and ideas within the engineers'? Or was it from the LLM? And so on.

    Before I'm misunderstood as a LLM minimalist, I want to say that I think they're incredibly good at solving for the blank page syndrome -- just getting a starting point on the page is useful. But I think that the code you actually want to ship is so far from what LLMs write, that I think of it more as a crutch for blank page syndrome than "they're good at writing code de novo".

    I'm open to being wrong and want to hear any discussion on the matter. My worry is that this is another one of the "illusion of progress" traps, similar to the one that currently fools people with the prose side of things.

    replies(9): >>46178640 #>>46178642 #>>46178818 #>>46179080 #>>46179150 #>>46179217 #>>46179552 #>>46180049 #>>46180734 #
    2. dcre ◴[] No.46178640[source]
    In my experience, LLMs have been quite capable of producing code I am satisfied with (though of course it depends on the context — I have much lower standards for one-off tools than long-lived apps). They are able to follow conventions already present in a codebase and produce something passable. Whereas with writing prose, I am almost never happy with the feel of what an LLM produces (worth noting that Sonnet and Opus 4.5’s prose may be moving up from disgusting to tolerable). I think of it as prose being higher-dimensional — for a given goal, often the way to express it in code is pretty obvious, and many developers would do essentially the same thing. Not so for prose.
    3. lukasb ◴[] No.46178642[source]
    One difference is that clichéd prose is bad and clichéd code is generally good.
    replies(1): >>46178649 #
    4. joshka ◴[] No.46178649[source]
    Depends on what your prose is for. If it's for documentation, then prose which matches the expected tone and form of other similar docs would be clichéd in this perspective. I think this is a really good use of LLMs - making docs consistent across a large library / codebase.
    replies(3): >>46178656 #>>46178665 #>>46178668 #
    5. minimaxir ◴[] No.46178656{3}[source]
    I have been testing agentic coding with Claude 4.5 Opus and the problem is that it's too good at documentation and test cases. It's thorough in a way that it goes out of scope, so I have to edit it down to increase the signal-to-noise.
    replies(2): >>46178944 #>>46179605 #
    6. dcre ◴[] No.46178665{3}[source]
    Docs also often don’t have anyone’s name on them, in which case they’re already attributed to an unknown composite author.
    7. danenania ◴[] No.46178668{3}[source]
    A problem I’ve found with LLMs for docs is that they are like ten times too wordy. They want to document every path and edge case rather focusing on what really matters.

    It can be addressed with prompting, but you have to fight this constantly.

    replies(2): >>46178826 #>>46181865 #
    8. averynicepen ◴[] No.46178818[source]
    Writing is an expression of an individual, while code is a tool used to solve a problem or achieve a purpose.

    The more examples of different types of problems being solved in similar ways present in an LLM's dataset, the better it gets at solving problems. Generally speaking, if it's a solution that works well, it gets used a lot, so "good solutions" become well represented in the dataset.

    Human expression, however, is diverse by definition. The expression of the human experience is the expression of a data point on a statistical field with standard deviations the size of chasms. An expression of the mean (which is what an LLM does) goes against why we care about human expression in the first place. "Interesting" is a value closely paired with "different".

    We value diversity of thought in expression, but we value efficiency of problem solving for code.

    There is definitely an argument to be made that LLM usage fundamentally restrains an individual from solving unsolved problems. It also doesn't consider the question of "where do we get more data from".

    >the code you actually want to ship is so far from what LLMs write

    I think this is a fairly common consensus, and my understanding is the reason for this issue is limited context window.

    replies(2): >>46178903 #>>46182453 #
    9. bigiain ◴[] No.46178826{4}[source]
    I think probably my most common prompt is "Make it shorter. No more than ($x) (words|sentences|paragraphs)."
    replies(1): >>46181980 #
    10. twodave ◴[] No.46178903[source]
    I argue that the intent of an engineer is contained coherently across the code of a project. I have yet to get an LLM to pick up on the deeper idioms present in a codebase that help constrain the overall solution towards these more particular patterns. I’m not talking about syntax or style, either. I’m talking about e.g. semantic connections within an object graph, understanding what sort of things belong in the data layer based on how it is intended to be read/written, etc. Even when I point it at a file and say, “Use the patterns you see there, with these small differences and a different target type,” I find that LLMs struggle. Until they can clear that hurdle without requiring me to restructure my entire engineering org they will remain as fancy code completion suggestions, hobby project accelerators, and not much else.
    11. girvo ◴[] No.46178944{4}[source]
    The “change capture”/straight jacket style tests LLMs like to output drive me nuts. But humans write those all the time too so I shouldn’t be that surprised either!
    replies(1): >>46180074 #
    12. themk ◴[] No.46179080[source]
    I recently published an internal memo which covered the same point, but I included code. I feel like you still have a "voice" in code, and it provides important cues to the reviewer. I also consider review to be an important learning and collaboration moment, which becomes difficult with LLM code.
    13. AlexCoventry ◴[] No.46179150[source]
    > I think that the code you actually want to ship is so far from what LLMs write

    It depends on the LLM, I think. A lot of people have a bad impression of them as a result of using cheap or outdated LLMs.

    14. mcqueenjordan ◴[] No.46179217[source]
    I guess to follow up slightly more:

    - I think the "if you use another model" rebuttal is becoming like the No True Scotsman of the LLM world. We can get concrete and discuss a specific model if need be.

    - If the use case is "generate this function body for me", I agree that that's a pretty good use case. I've specifically seen problematic behavior for the other ways I'm seeing it OFTEN used, which is "write this feature for me", or trying to one shot too much functionality, where the LLM gets to touch data structures, abstractions, interface boundaries, etc.

    - To analogize it to writing: They shouldn't/cannot write the whole book, they shouldn't/cannot write the table of contents, they cannot write a chapter, IMO even a paragraph is too much -- but if you write the first sentence and the last sentence of a paragraph, I think the interpolation can be a pretty reasonable starting point. Bringing it back to code for me means: function bodies are OK. Everything else gets questionable fast IME.

    15. IgorPartola ◴[] No.46179552[source]
    My suspicion is that this is a form of the paradox where you can recognize that the news being reported is wrong when it is on a subject in which you are an expert but then you move onto the next article on a different subject and your trust resumes.

    Basically if you are a software engineer you can very easily judge quality of code. But if you aren’t a writer then maybe it is hard for you to judge the quality of a piece of prose.

    replies(1): >>46192792 #
    16. diamond559 ◴[] No.46179605{4}[source]
    If the goal is to document the code and it gets sidetracked and focuses on only certain parts it failed the test. It just further proves llm's are incapable of grasping meaning and context.
    17. make_it_sure ◴[] No.46180049[source]
    try Opus 4.5, you'll be surprised. It might be true for past versions of LLMs, but they advanced a lot.
    18. mulmboy ◴[] No.46180074{5}[source]
    What do these look like?
    replies(1): >>46180900 #
    19. cheeseface ◴[] No.46180734[source]
    There are cases where I would start the coding process by copy-pasting existing code (e.g. test suites, new screens in the UI) and this is where LLMs work especially well and produce code that is majority of the time production-ready as-is.

    A common prompt I use is approximately ”Write tests for file X, look at Y on how to setup mocks.”

    This is probably not ”de novo” and in terms of writing is maybe closer to something like updating a case study powerpoint with the current customer’s data.

    20. pmg101 ◴[] No.46180900{6}[source]

      1. Take every single function, even private ones.
      2. Mock every argument and collaborator.
      3. Call the function.
      4. Assert the mocks were  called in the expected way.
    
    These tests help you find inadvertent changes, yes, but they also create constant noise about changes you intend.
    replies(3): >>46182913 #>>46183812 #>>46185551 #
    21. pxc ◴[] No.46181865{4}[source]
    > A problem I’ve found with LLMs for docs is that they are like ten times too wordy

    This is one of the problems I feel with LLM-generated code, as well. It's almost always between 5x and long and 20x (!) as long as it needs to be. Though in the case of code verbosity, it's usually not because of thoroughness so much as extremely bad style.

    22. pxc ◴[] No.46181980{5}[source]
    I've never been able to get that to work. LLMs can't count; they don't actually know how long their output is.
    23. mac-attack ◴[] No.46182453[source]
    Very well stated.
    24. ornornor ◴[] No.46182913{7}[source]
    Juniors on one of the teams I work with only write this kind of tests. It’s tiring, and I have to tell them to test the behaviour, not the implementation. And yet every time they do the same thing. Or rather their AI IDE spits these out.
    25. senbrow ◴[] No.46183812{7}[source]
    These tests also break encapsulation in many cases because they're not testing the interface contract, they're testing the implementation.
    26. girvo ◴[] No.46185551{7}[source]
    You beat me to it, and yep these are exactly it.

    “Mock the world then test your mocks”, I’m simply not convinced these have any value at all after my nearly two decades of doing this professionally

    27. knollimar ◴[] No.46192792[source]
    Gell-Mann amnesia