Most active commenters
  • runako(6)
  • hbn(3)
  • hluska(3)

←back to thread

302 points mastermaq | 42 comments | | HN request time: 1.899s | source | bottom
Show context
hbn ◴[] No.44370715[source]
My work gives us Copilot access since We're a Microsoft Shop™

It's too bad Copilot is by far the dumbest competitor in the space

My favorite interaction so far was when I prompted it with:

    ffmpeg command to convert movie.mov into a reasonably sized mp4
Sure, it's not the most direction instructions, but I tend to give it just enough to get the job done, assuming the LLM knows what its purpose is as an LLM, and it always works with the other chatbots.

Copilot's response:

    I implemented and executed the Python code above to convert movie.mov to a reasonably sized movie.mp4 using ffmpeg.
    
    However, the Python code failed since it was not able to find and access movie.mov file.
    Do you want me to try again or is there anything else that I can help you with?
Note that I didn't cut anything out. It didn't actually provide me any "Python code above"
replies(22): >>44370829 #>>44371002 #>>44371022 #>>44371053 #>>44371065 #>>44371287 #>>44371335 #>>44371358 #>>44371628 #>>44371891 #>>44371914 #>>44371978 #>>44372301 #>>44372892 #>>44373260 #>>44373493 #>>44373864 #>>44374419 #>>44374747 #>>44376761 #>>44377612 #>>44379849 #
NordSteve ◴[] No.44371358[source]
I pasted your prompt:

  ffmpeg command to convert movie.mov into a reasonably sized mp4
into the Copilot app just now. Here's the response:

  ffmpeg -i movie.mov -vcodec libx264 -crf 23 -preset medium -acodec aac -b:a 128k movie_converted.mp4
with this explanation

  Here’s the breakdown:
  - -vcodec libx264: Uses the H.264 video codec, great for compression and compatibility.
  - -crf 23: Constant Rate Factor — lower means better quality and larger size; higher means smaller size but lower quality. 23 is a nice middle ground.
  - -preset medium: Controls encoding speed vs. compression efficiency. You can try slow or faster depending on your needs.
  - -acodec aac -b:a 128k: Compresses the audio using AAC at 128 kbps — pretty standard and compact.
  If you're looking for even smaller file sizes, try bumping -crf up to 26 or using -preset slow.
  Want to bake in scaling, remove metadata, or trim the video too? Happy to tailor the command further.
replies(5): >>44371528 #>>44372925 #>>44373007 #>>44374040 #>>44376405 #
1. Aurornis ◴[] No.44373007[source]
I did the same thing for several iterations and all of the responses were equally helpful.

We get these same anecdotes about terrible AI answers frequently in a local Slack I’m in. I think people love to collect them as proof that AI is terrible and useless. Meanwhile other people have no problem hitting the retry button and getting a new answer.

Some of the common causes of bad or weird responses that I’ve learned from having this exact same conversation over and over again:

- Some people use one never-ending singular session with Copilot chat, unaware that past context is influencing the answer to their next question. This is a common way to get something like Python code in response to a command line question if you’re in a Python project or you’ve been asking Python questions.

- They have Copilot set to use a very low quality model because they accidentally changed it, or they picked a model they thought was good but is actually a low-cost model meant for light work.

- They don’t realize that Copilot supports different models and you have to go out of your way to enable the best ones.

AI discussions are weird because there are two completely different worlds of people using the same tools. Some people are so convinced the tool will be bad that they give up at the slightest inconvenience or they even revel in the bad responses as proof that AI is bad. The other world spends some time learning how to use the tools and work with a solution that doesn’t always output the right answer.

We all know AI tools are not as good as the out of control LinkedIn influencer hype, but I’m also tired of the endless claims that the tools are completely useless.

replies(9): >>44373253 #>>44373471 #>>44373558 #>>44373634 #>>44373822 #>>44375598 #>>44375719 #>>44376790 #>>44377124 #
2. JohnMakin ◴[] No.44373253[source]
The thing responses like this miss I am pretty sure is that this is a nondeterministic machine, and nondeterministic machines that are hidden by a complete blackbox wrapper can produce wildly different results based on context and any number of independent unknown variables. so pasting “i did the same thing and it worked fine” is essentially this argument’s version of “it worked on my local.” Or it essentially boils down to “well sure, but you’re just not doing it right” when the “right” way is undefined and also context specific.
replies(1): >>44373638 #
3. jiggawatts ◴[] No.44373471[source]
Everything is like this.

I saw an IT professional google “My PC crashed” to diagnose a server bluescreen stop error.

4. sunaookami ◴[] No.44373558[source]

  Some people are so convinced the tool will be bad that they give up at the slightest inconvenience or they even revel in the bad responses as proof that AI is bad
AI derangement syndrome
5. muppetman ◴[] No.44373634[source]
The "pick your model" thing is so stupid.

"How dumb do you want your AI to be?"

"Why do I have to select?"

"Because smart costs money"

"So... I can have dumb AI but it's cheaper?"

"Yes"

"How would the average person know which to pick?"

"Oh you can't know."

I hope they can invent an AI that knows which AI model my question should target cheaply.

replies(7): >>44373707 #>>44374320 #>>44374834 #>>44376182 #>>44376980 #>>44377115 #>>44380649 #
6. josephg ◴[] No.44373638[source]
You’re both right. Some problems should be solved with better user education. And some should be solved with better UX. It’s not always clear which is which. It’s too simple to blame everything on user error, and it’s too simple to blame everything on the software.

Cell phones are full of examples. So much of this stuff is obvious now we’ve been using them for awhile, but it wasn’t obvious when they were new. “My call dropped because I went in a tunnel” is user error. “My call cut out randomly and I had to call back” is a bug. And “my call cut out because my phone battery ran out” is somewhere in the middle.

For chatbots, lots of people don’t know the rules yet. And we haven’t figured out good conventions. It’s not obvious that you can’t just continue a long conversation forever. Or that you have to (white consciously) pick which model you use if you want the best results. When my sister first tried ChatGPT, she asked it for YouTube video recommendations that would help when teaching a class. But none of the video links worked - they were all legitimate looking hallucinations.

We need better UX around this stuff. But also, people do just need to learn how to use chatbots properly. Eventually everyone learns that calls will probably drop when you go into a tunnel. It’s not one or the other. It’s both.

7. reilly3000 ◴[] No.44373707[source]
I think you make a good point. Cursor is doing a basic “auto” model selection feature and it could probably get smarter, but to gauge the complexity of the response you might need to run it first. You could brute force it with telemetry and caching if you can trust the way you measure success.
8. rienbdj ◴[] No.44373822[source]
Reminds me of

I’m Feeling Lucky -> bad result -> Google search is useless

replies(2): >>44374342 #>>44375541 #
9. HappMacDonald ◴[] No.44374320[source]
You bring up the important point that for a company who earns money off of tokens wasted, a confusing selection of models can translate into extra spend to experiment with tweaking them.

Some users may not appreciate that, but many more might be drawn to the "adjust the color balance on the TV" vibes.

10. krige ◴[] No.44374342[source]
Yeah except Feeling Lucky is the only button you can press and people blame you if they got lucky
11. addandsubtract ◴[] No.44374834[source]
> I hope they can invent an AI that knows which AI model my question should target cheaply.

Isn't that the idea of OpenRouter?

replies(1): >>44375337 #
12. oersted ◴[] No.44375337{3}[source]
Not exactly, but yeah. OpenRouter is a unified API, directory and billing system for LLM providers.

I think you are getting confused by the term "Model Routing", which to be fair OpenRouter does support, but it's a secondary feature and it's not their business focus. Actually OpenRouter is more focused on helping you choose the best provider for a specific open model based on their history of price, speed, reliability, privacy...

The model routing is simply provided by NotDiamond.ai, there are a number of other startups in this space.

https://openrouter.ai/docs/features/model-routing

13. darkwater ◴[] No.44375541[source]
1. I would say that nobody did that, so you are making up a straw man

2. The Copilot or ChatGPT or Claude "Ask" buttons should then be renamed to "I'm feeling lucky". And that would be the only button available.

14. msgodel ◴[] No.44375598[source]
This is part of why I really like local models. I always use the same random seed with mine so unless I'm using aider the responses are 100% deterministic. I can actually hit c-r in my shell to reproduce them without having to do anything special.
replies(1): >>44376264 #
15. lukan ◴[] No.44375719[source]
"I’m also tired of the endless claims that the tools are completely useless."

Who claimed that here?

I read a claim that Copilot is dumber than claude and ChatGPT and I tend to confirm this.

"They don’t realize that Copilot supports different models and you have to go out of your way to enable the best ones."

So possible that none of us who thinks that, went out of our way to find outy when there were working alternatives, but it would be still on Microsoft making it hard to make good use of their tool.

replies(1): >>44379180 #
16. prepend ◴[] No.44376182[source]
I imagine that we need a bootstrap ai to help you optimize the right ai for each task.

I don’t think I’d trust the vendor’s ai to optimize when they will likely bias toward revenue. So a good case for a local ai that only has my best interests at heart.

Currently, the guidance from vendors is “try it and see which yields the best results” which is kind of like “buy this book, read it, and see if you like it” and how of course the publisher wants you to take this action because they get their money.

17. muzani ◴[] No.44376264[source]
Some are more deterministic than others, e.g. Gemini Flash.
replies(1): >>44378446 #
18. ryanar ◴[] No.44376790[source]
The memory feature also can be a problem, it injects stuff into the prompt context that you didnt explicitly write with the intent it will help because it knows you are a python programmer so lets respond with a python script instead of our usual ffmpeg cli command.
19. setopt ◴[] No.44376980[source]
> I hope they can invent an AI that knows which AI model my question should target cheaply.

It would be great to have a cheap AI that can self-evaluate how confident it is in its reply, and ask its expensive big brother for help automatically when it’s not.

replies(1): >>44377061 #
20. taikahessu ◴[] No.44377061{3}[source]
That would actually be the AGI we are waiting for, since we - as humans, in surprisingly big portion of all cases - don't know how or can't seem to do that either!
replies(1): >>44385297 #
21. runako ◴[] No.44377115[source]
And then the model names & descriptions are virtually useless at providing any guidance.

ChatGPT lets me choose between GPT-4o ("Great for most tasks"), o3 ("Uses advanced reasoning"), o4-mini ("Fastest at advanced reasoning"), and o4-mini-high ("Great at coding and visual reasoning").

Is what I'm doing "most tasks"? How do I know when I want "advanced reasoning"? Great, I want advanced reasoning, so I should choose the faster one with the higher version number, right? etc.

replies(3): >>44377211 #>>44378301 #>>44378654 #
22. otabdeveloper4 ◴[] No.44377124[source]
"Spin the chatroulette again and see if you vibe something better" is not a foundation for a business.

Well, unless your business is selling vibes.

23. dataflow ◴[] No.44377211{3}[source]
Can't you just run a few examples by hand to see how they perform for your tasks, before committing to any for production?
replies(1): >>44377896 #
24. runako ◴[] No.44377896{4}[source]
> before committing to any for production

I'm talking about ChatGPT, which is a Web and desktop app where users run interactive sessions. What does "production" mean in this sense?

replies(2): >>44378495 #>>44379891 #
25. antonkochubey ◴[] No.44378301{3}[source]
Then there's GPT-4.5 which is "Good for writing and exploring ideas" (are the other models bad for this?), and GPT-4.1 which is "Great for quick coding and analysis" (is a model which "uses advanced reasoning" not great for these things?)
26. msgodel ◴[] No.44378446{3}[source]
The non-determinism comes from the sampler not the model.
replies(1): >>44383762 #
27. dataflow ◴[] No.44378495{5}[source]
I think I misunderstood what people were talking about. Somehow I thought it was about their APIs, for specific uses in other apps.
replies(1): >>44379178 #
28. tzs ◴[] No.44378654{3}[source]
Can you describe your task and then ask ChatGPT which model you should use?
replies(1): >>44379154 #
29. runako ◴[] No.44379154{4}[source]
This presents the same problem, since none of the models are indicated to be best at choosing the model to use for a task.
replies(1): >>44379880 #
30. runako ◴[] No.44379178{6}[source]
To their credit, they did get this part correct. "ChatGPT" is the user-facing apps. The models have terrible names that do not include "ChatGPT".

Anthropic, by contrast, uses the same name for the user-facing app and the models. This is confusing, because the user-facing apps have capabilities not native to the models themselves.

31. hbn ◴[] No.44379180[source]
Yeah I'm not sure why they'd think my point was that LLMs are useless. Clearly I'm integrating them into my work, I just think Copilot is the dumbest. It's given me the most nonsensical answers like the example I provided, and it's the one I use the least. Which is even crazier when you consider we're on a paid version of Copilot and I just use free ChatGPT and Claude.
replies(1): >>44379870 #
32. hluska ◴[] No.44379870{3}[source]
Your entire comment sure read a lot like you were calling the tools useless. You even used the worst possible prompt to make your point. That’s likely why people are reacting badly.
replies(1): >>44381750 #
33. hluska ◴[] No.44379880{5}[source]
Try different ones out and learn which works best for what type of work?
replies(1): >>44380730 #
34. hluska ◴[] No.44379891{5}[source]
It’s simple - practice using them instead of complaining. Maybe you’ll figure out the differences on your own.
replies(2): >>44379979 #>>44380708 #
35. BobaFloutist ◴[] No.44379979{6}[source]
Imagine if this is what people suggested when I asked what kind of screwdriver I should use for a given screw, because they're all labelled, like, "Phillips. Phillips 2.0. Phillips.2.second. Phillips.2.second.version 2.0. Phillips Head Screwdriver. Phillips.2.The.Second.Version. Phillips.2.the.second.Version 2.0"
36. bastard_op ◴[] No.44380649[source]
I usually feel with chatgpt picking a model is like "Which of the three stooges would you like to talk to, curly, larry, or moe (or worse, curly joe)?" I usually only end up using o3 because gpt-40 is just that bad, so why would I ever want to talk to a lesser stooge?

If paying by API use it probably makes more sense to talk to a lesser stooge where possible, but for a standard pro plan I just find the lesser models aren't worth the time to use in frustration they cause.

37. runako ◴[] No.44380708{6}[source]
As a person who uses LLMs daily, I do in fact do this. Couple problems with this approach:

- there are billions of people who are not accustomed to using software this way, who are in the expected target market for this software. Most people cannot tell you the major version number of their mobile OS.

- this approach requires each individual to routinely perform experiments with the expanding firmament of models and versions. This is obviously user-hostile.

Anyway, my hot take here is that making things easier for users is better. I understand that is controversial on this site.

38. runako ◴[] No.44380730{6}[source]
Without getting too much into semantics, I would suspect that most individuals would have trouble classifying their "type of work" against an opaque set of "type of work" classifiers buried in a model.
replies(1): >>44380977 #
39. elbear ◴[] No.44380977{7}[source]
He was suggesting that you try different models for the same thing and see which output you like best. It's tedious but at least you get an answer.
40. hbn ◴[] No.44381750{4}[source]
I said the Copilot is the worst competitor in the space.

Where did I say anything in general about LLMs being useless?

41. muzani ◴[] No.44383762{4}[source]
I always thought it was packaged with the model.
42. setopt ◴[] No.44385297{4}[source]
On the other hand, ChatGPT seems to be getting better at knowing when it should Google something for me rather than hallucinate something.

Shouldn’t asking a more expensive model for input be a similar level of «tool use»?