←back to thread

117 points soraminazuki | 3 comments | | HN request time: 0s | source
Show context
AbbeFaria ◴[] No.45080623[source]
I work at MSFT. There’s top down pressure to use LLMs everywhere. At this point, if you can convince your management about using LLMs anywhere, they would happily head nod and let you go do that. And management themselves are not that technical wrt LLMs, they are being fed the same AI hype slop that we are fed.

Most of these efforts have questionable returns and most projects will usually involve increasing test coverage or categorising customer incidents for better triage, apart from these low hanging fruits not much comes out of it.

People still play the visibility game though. Hey, look at what we did using LLMs. That’s so cool, now where’s my promotion? Business outcomes wise, there’s some low hanging fruits that have been plucked but otherwise it doesn’t live up to the hype.

Personally for me, it is helpful in a few scenarios,

1. Much better search interface than traditional search engines. If I want to ramp up on some new technology or product, it gives me a good broad overview and references to dive deep. No more 10 blue links.

2. Better autocomplete than before but it’s still not as groundbreaking as AI hype hucksters make it out to be

3. If I want to learn some concepts (say how ext4 FS works), it can give a good breakdown of the high level concepts and then I go need to study and come back with more Q’s. This is the only genuine use case that I really like. Where I can iteratively ask Q’s to clarify and cement my understanding of a concept. I have used Claude code and ChatGPT for this and I can barely see any difference between the two.

This is my balanced take.

replies(1): >>45080815 #
1. bonzini ◴[] No.45080815[source]
I have a similar mandate and a similar take, but slightly different use cases.

As to the search engine, my searches are often very narrow, like I want to recall a specific message from a mailing list, so I don't use that too much. On the other hand, I found Google's NotebookLM to be really good at recalling concepts from both source code and manuals (e.g. processor manuals in my case).

Code generators are incredible refactoring machines. In one case (not so easy to reproduce in general, but it did work) Claude Code did a Python to decently idiomatic Rust conversion in a matter of minutes; it added mypy annotations to 2000 lines of Python code (with 90% accuracy) in half an hour and got the entire job done with my assistance in about an hour. For the actual writing and debugging where the logic matters they're still not there even for small code bases (again 2000 lines of code ballpark). They're relatively good at writing and debugging testcases but IMO that's also where there's a risk of copyright taint. Anyhow it's something I would use maybe 2-3 times a month.

In one case I used it for natural language translation, with pretty good results, but I knew both languages because I needed to check the result. Ask it first to develop a glossary and then to translate.

For studying they're interesting too, though for now I have mostly tried that outside work. At work, Google Deep Research worked well compared to the time it takes and it's able to find a variety of sources (including HackerNews comments in one case :)) which is useful for cross-checking.

replies(1): >>45082739 #
2. Neywiny ◴[] No.45082739[source]
So what does 90% accuracy mean here? Is this like you ran it through a linter or language server and 90% had errors? Or just through a quick glance you felt it was that accurate?

I've found incorrect type hints to be one of the biggest issues when trying to use Python type-safely. Mostly (entirely?) with packages that get their own hints wrong, meaning methods aren't shown as existing within the class or the returned instance isn't the class it said it would be.

replies(1): >>45086691 #
3. bonzini ◴[] No.45086691[source]
90% accuracy was just my own hunch but it is true that, out of 2000 lines, roughly 200 lines were changed, and there were roughly 20 mypy errors left after the first fully automatic pass.

Most of these were due to code that did use duck typing, but it turned out that this was dead code and therefore it's understandable that the LLM got confused. Once the dead code was removed, the remaining 5 or so issues were annotations that were too loose in either variable or argument declarations. It would have taken at least a couple hours to do it by hand.