Most active commenters

bahmboo(3)
cat_plus_plus(3)

Popular/hot comments

>>45663098 #
>>45664202 #

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code

(simonwillison.net)

1. syntaxing ◴[20 Oct 25 21:29 UTC] No.45649620[source]▶

Ehh, is it cool and time savings that it figured it out? Yes. But the solution was to get a “better” version prebuilt wheel package of PyTorch. This is a relatively “easy” problem to solve (figuring out this was the problem does take time). But it’s (probably, I can’t afford one) going to be painful when you want to upgrade the cuda version or specify a specific version. Unlike a typical PC, you’re going to need to build a new image and flash it. I would be more impressed when a LLM can do this end to end for you.

replies(2): >>45652848 #>>45662953 #

2. sh3rl0ck ◴[21 Oct 25 05:46 UTC] No.45652848[source]▶

>>45649620 #

Pytorch + CUDA is a headache I've seen a lot of people have at my uni, and one I've never had to deal with thanks to uv. Good tooling really does go a long way in these things.

Although, I must say that for certain docker pass through cases, the debugging logs just aren't as detailed

replies(1): >>45661767 #

3. ◴[21 Oct 25 19:12 UTC] No.45660269[source]▶

>>45646559 (OP) #

4. BoredPositron ◴[21 Oct 25 20:12 UTC] No.45661063[source]▶

>>45646559 (OP) #

Compute well spent... finding out to download a version and hardware appropriate wheel.

replies(2): >>45662447 #>>45664696 #

5. ComputerGuru ◴[21 Oct 25 21:11 UTC] No.45661767{3}[source]▶

>>45652848 #

uv doesn’t fundamentally solve the issues. It didn’t invent venv or pip.

What fundamentally solves the issue is to use an onnx version of the model.

replies(1): >>45661872 #

6. simonw ◴[21 Oct 25 21:21 UTC] No.45661872{4}[source]▶

>>45661767 #

Do you know if it's possible to run ONNX versions of models on a Mac?

I should try those on the NVIDIA Spark, be interesting to see if they are easy to work with on ARM64.

replies(1): >>45664128 #

7. Zopieux ◴[21 Oct 25 22:19 UTC] No.45662447[source]▶

>>45661063 #

Gotta keep the hype up!

8. varispeed ◴[21 Oct 25 22:58 UTC] No.45662806[source]▶

>>45646559 (OP) #

I am the only one seeing this Nvidia Spark as meh?

I had it in my cart, but then watched few videos from influencers and it looks like power of this thing doesn't match the hype.

replies(2): >>45663046 #>>45663664 #

9. bahmboo ◴[21 Oct 25 23:04 UTC] No.45662877[source]▶

>>45646559 (OP) #

I see a lot of snark in the comments. Simon is a researcher and I really like seeing his experiments! Sounds like the goal here was to delegate a discrete task to an LLM and have it solve the problem much like one would task a junior dev to do the same.

And like a junior dev it ran into some problems and needed some nudges. Also like a junior dev it consumed energy resources while doing it.

In the end I like that the chunk size of work that we can delegate to LLMs is getting larger.

replies(2): >>45663098 #>>45665643 #

10. cat_plus_plus ◴[21 Oct 25 23:11 UTC] No.45662944[source]▶

>>45646559 (OP) #

No idea why Nvidia has such crusty torch prebuilds on their own hardware. Just finished installing unsloth on a Thor box for some finnetuning, it's a lengthy build marathon, thankfully aided by Grok giving commands/environment variables for the most part (one finishing touch is to install latest CUDA from nvidia website and then replace compiler executables in triton package with newer ones from CUDA).

replies(2): >>45663192 #>>45665204 #

11. cat_plus_plus ◴[21 Oct 25 23:11 UTC] No.45662953[source]▶

>>45649620 #

You can still upgrade CUDA within forward compatibility range and install new packages without reflashing.

12. dumbmrblah ◴[21 Oct 25 23:23 UTC] No.45663046[source]▶

>>45662806 #

For inference might as well get a strix halo for half the price.

13. Upvoter33 ◴[21 Oct 25 23:30 UTC] No.45663098[source]▶

>>45662877 #

No offense, but I hate all the comparisons to a "junior dev" that I see out there. This process is just like any dev! I mean, who wouldn't have to tinker around a bit to get some piece of software to work? Is there a human out there who would just magically type all the right things - no errors - first try?

replies(4): >>45663166 #>>45663461 #>>45664538 #>>45664889 #

14. solumos ◴[21 Oct 25 23:39 UTC] No.45663166{3}[source]▶

>>45663098 #

> And like a junior dev it ran into some problems and needed some nudges.

There are people who don't get blocked waiting for external input in order to get tasks like this done, which I think is the intended comparison. There's a level of intuition that junior devs and LLMs don't have that senior devs do.

replies(1): >>45663975 #

15. htrp ◴[21 Oct 25 23:42 UTC] No.45663192[source]▶

>>45662944 #

serious q, why grok vs another frontier model?

replies(1): >>45664571 #

16. qingcharles ◴[22 Oct 25 00:08 UTC] No.45663407[source]▶

>>45646559 (OP) #

I did the opposite yesterday. I used GPT5 to brute force dotnet into Claude Code for Web, which finally involved it writing an entire HTTP proxy in Python to download nuget packages.

17. bahmboo ◴[22 Oct 25 00:14 UTC] No.45663461{3}[source]▶

>>45663098 #

Point taken and I should have known better. I fully agree with you. I suppose I should say inexperienced dev or something more accurate. Having worked with many inexperienced devs there was quite a spread in capabilities. Using terms that are dismissive to individuals is not helpful.

18. throwaway48476 ◴[22 Oct 25 00:43 UTC] No.45663664[source]▶

>>45662806 #

Its also going to be unsupported after a few years.

19. the-grump ◴[22 Oct 25 01:32 UTC] No.45663975{4}[source]▶

>>45663166 #

To offer a counterpoint, I had much better intuition as a junior than I do now, and it was also better than the seniors on my team.

Sometimes looking at the same type of code and the same infra day in and day out makes you rusty. In my olden days, I did something different every week, and I had more free time to experiment.

replies(2): >>45664202 #>>45665113 #

20. ComputerGuru ◴[22 Oct 25 01:59 UTC] No.45664128{5}[source]▶

>>45661872 #

Yup. The beauty of it is that the underlying ai accelerator/hardware is completely abstracted away. There’s a CoreML ONNX execution provider, though I haven’t used it.

No more fighting with hardcoded cuda:0 everywhere.

The only pain point is that you’ll often have to manually convert a PyTorch model from huggingface to onnx unless it’s very popular.

21. fastball ◴[22 Oct 25 02:14 UTC] No.45664202{5}[source]▶

>>45663975 #

So you are a worse dev now than you were before? Have you asked for a pay cut from your employer?

replies(3): >>45664681 #>>45665120 #>>45672474 #

22. conradev ◴[22 Oct 25 03:11 UTC] No.45664538{3}[source]▶

>>45663098 #

Codex is actually pretty good at getting things working and unblocking itself.

It’s just that when I review the code, I would do things differently because the agent doesn’t have experience with our codebase. Although it is getting better at in-context learning from the existing code, it is still seeing all of it for the “first time”.

It’s not a junior dev, it’s just a dev perpetually in their first week at a new job. A pretty skilled one, at that!

and a lot of things translate. How well do you onboard new engineers? Well written code is easier to read and modify, tests helps maintain correctness while showing examples, etc.

23. cat_plus_plus ◴[22 Oct 25 03:16 UTC] No.45664571{3}[source]▶

>>45663192 #

Grok browses a large number of websites for queries that need recent information, which is super handy for new hardware like Thor.

24. arthurcolle ◴[22 Oct 25 03:38 UTC] No.45664681{6}[source]▶

>>45664202 #

pay increase - with better tools, I'd imagine

25. prodigycorp ◴[22 Oct 25 03:41 UTC] No.45664696[source]▶

>>45661063 #

Don't ask how many human compute hours are spent figuring this out.

26. pedrosorio ◴[22 Oct 25 04:29 UTC] No.45664889{3}[source]▶

>>45663098 #

> Is there a human out there who would just magically type all the right things - no errors - first try?

If they know what they're doing and it's not an exploratory task where the most efficient way to do it is by trial and error? Quite a few. Not always, but often.

That skill seems to have very little value in today's world though.

27. amirhirsch ◴[22 Oct 25 04:38 UTC] No.45664936[source]▶

>>45646559 (OP) #

I also use Claude Code to install CUDA and PyTorch and HuggingFace models on my quad A100 machine. Shouldn't feel like debugging a 2000s Linux driver.

HuggingFace has incredible reach but poor UX, and PyTorch installs remain fragile. There’s real space here for a platform that makes this all seamless maybe even something that auto-updates a local SSD with fresh models to try every day.

replies(1): >>45665698 #

28. hkt ◴[22 Oct 25 04:59 UTC] No.45665033[source]▶

>>45646559 (OP) #

> There’s honestly so much material in the resulting notes created by Claude that I haven’t reviewed all of it

I've had the same "problem" and feel like this is the major hazard involved. It is tricky to validate the written work Claude (or any other LLM) produces due to high levels of verbosity, and the temptation to say "well, it works!"

As ever though, it is impressive what we can do with these things.

If I were Simon, I might have asked Claude (as a follow up) to create a minimal ansible playbook, or something of that nature. That might also be more concise and readable than the notes!

29. baq ◴[22 Oct 25 05:15 UTC] No.45665113{5}[source]▶

>>45663975 #

Hobby coding is imho a high entropy signal that you joined the workforce with a junior title but basically senior experience, which is what I see from kids who learned programming young due to curiosity vs those who only started learning in university. IOW I suspect you were not a junior in anything but name and pay.

There’s also a factor of the young being very confident that they’re right ;)

30. baq ◴[22 Oct 25 05:16 UTC] No.45665120{6}[source]▶

>>45664202 #

https://en.wikipedia.org/wiki/Peter_principle

;)

31. vunderba ◴[22 Oct 25 05:17 UTC] No.45665128[source]▶

>>45646559 (OP) #

From the article:

> Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.

Given the importance of TDD in this style of continual agentic loop - I was a bit surprised to see that the author only seems to have provided an input but not an actual expected output.

Granted this is more difficult with OCR since you really don't know how well DeepSeek-OCR might perform, but a simple Jaccard sanity test between a very legible input image and expected output text would have made it a little more hands-off.

EDIT: After re-reading the article, I guess this was more of a test to see if DeepSeek-OCR would run at all. But I bet you could setup a pretty interesting TDD harness using the aforementioned algorithm with an LLM in a REPL trying to optimize Tesseract parameters against specific document types which was ALWAYS such a pain in the past.

32. ur-whale ◴[22 Oct 25 05:31 UTC] No.45665204[source]▶

>>45662944 #

> No idea why Nvidia has such crusty torch prebuilds on their own hardware.

NVidia is a hardware company at heart.

They do create and sell amazing hardware.

But like almost all hardware makers I know, they totally suck at software. Their ecosystem is an effing nightmare (drivers, compilers, etc...). It's a pure culture issue, where:

   a) software always comes as an afterthought

   b) the folks in charge of engineering are largely HW folks. They like like HW engineers and the resulting software stack looks exactly like a piece of silicon: opaque, static, inflexible and most important of all, never designed to be understood/looked at/reworked.

I suspect the reason why all their software is closed-source is not for commercial reasons, they're just ashamed as a company to show the world how shitty their SWE skills are.

33. fumeux_fume ◴[22 Oct 25 06:45 UTC] No.45665643[source]▶

>>45662877 #

Another top comment that fabricates negativity saying “I see a lot of snark in the comments.” Hardly anything negative, let alone snarky, is being directed to one of HNs most popular voices. It’s a weird dark pattern to get upvotes under the pretext of being supportive and it should be called out more often.

replies(1): >>45665720 #

34. nicman23 ◴[22 Oct 25 06:55 UTC] No.45665698[source]▶

>>45664936 #

what. i mean firmware has a lot of failsafes but i would not trust it to do sysops for a 100k machine

replies(2): >>45665897 #>>45667818 #

35. bahmboo ◴[22 Oct 25 06:59 UTC] No.45665720{3}[source]▶

>>45665643 #

Huh? My post was very positive and hopefully let's not get too meta. I felt the comments did not address the merits or contents of what was being presented. That's all!

edit: did you see my comment when it was first posted? The topic was being dumped on at that time and drowning out the signal. I promise I'm not part of a dark pattern.

replies(1): >>45666345 #

36. smokel ◴[22 Oct 25 07:32 UTC] No.45665897{3}[source]▶

>>45665698 #

It is indeed a disaster waiting to happen. Figuring out a way to run agents safely is apparently a harder problem than getting to AGI :)

replies(1): >>45667771 #

37. emporas ◴[22 Oct 25 07:41 UTC] No.45665955[source]▶

>>45646559 (OP) #

The point of DeepSeek-OCR is not an one way image recognition and textual description of the pixels, but text compression to a generated image, and text restoration from that generated image. This video explains it well [1].

> From the paper: Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.

It's main purpose is: a compression algorithm from text to image, throw away the text because it costs too many tokens, keep the image in the context window instead of the text, generate some more text, when text accumulates even more compress the new text to image and so on.

The argument is, pictures store a lot more information than words, "A picture is worth a thousand words" after all. Chinese characters are pictograms, it doesn't seem that strange to think that, but I don't buy it.

I am doing some experiments of removing text as an input for LLMs and replacing with it's summary, and I have reduced the context window by 7 times already. I am still figuring what it the best way to achieve that, but 10 times is not far off. My experiments involve novel writing not general stuff, but still it works very well just replacing text with it's summary.

If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

[1] https://www.youtube.com/watch?v=YEZHU4LSUfU

replies(1): >>45666226 #

38. CaptainOfCoit ◴[22 Oct 25 08:22 UTC] No.45666226[source]▶

>>45665955 #

> If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

I mean, some really dense languages basically do, like APL using symbols we (non-mathematicians) rarely even see.

replies(1): >>45666767 #

39. CaptainOfCoit ◴[22 Oct 25 08:39 UTC] No.45666345{4}[source]▶

>>45665720 #

> I felt the comments did not address the merits or contents of what was being presented

That's wildly different from:

> I see a lot of snark in the comments

The first is essentially "People should talk about the actual content more" while the second is "People are mocking the submission". If no other top-level comments actually seem like mocking, it seems fair that someone reacts to the second sentiment

> The topic was being dumped on at that time and drowning out the signal.

There was a whole of 7 comments when you made yours (https://ditzes.com/item/45646559), most of them children comments, none of them mocking the submission nor the author.

40. emporas ◴[22 Oct 25 09:42 UTC] No.45666767{3}[source]▶

>>45666226 #

Combinators are math though. There is a section in the paper that covers the topic of graphs and charts, transforming them to text and then back again to image. They claim 97% precision.

> within a 10× compression ratio, the model’s decoding precision can reach approximately 97%, which is a very promising result. In the future, it may be possible to achieve nearly 10× lossless contexts compression through text-to-image approaches.

Graphs and charts should be represented as math, i.e. text, that's what they are anyway, even when they are represented as images, it is much more economical to be represented as math.

The function f(x)=x can be represented by an image of (10pixels x 10pixels) dimensions, (100pixels x 100pixels) or (infinite pixels x infinite pixels).

A math function is worth infinite pictures.

41. qwertytyyuu ◴[22 Oct 25 12:02 UTC] No.45667771{4}[source]▶

>>45665897 #

thus is the AI alignment problem

42. amirhirsch ◴[22 Oct 25 12:06 UTC] No.45667818{3}[source]▶

>>45665698 #

it’s all running inside a Proxmox VM with IOMMU and GPU passthrough. It’s as safe as doing the same on any cloud system.

Also the machine is well north of 100K when you include the RF ADCs and DACs in there that run a radar.

Worst case, I have multiple.

replies(1): >>45691095 #

43. achr2 ◴[22 Oct 25 13:36 UTC] No.45668961[source]▶

>>45646559 (OP) #

My assumption is that there could be higher dimensional 'token' representations that can be used instead of vision tokens (though obviously not human interpretable). I wonder if there is an optimisation for compression that takes the vector space at a specific level in the network to provide the most context for minimal memory space.

44. the-grump ◴[22 Oct 25 17:35 UTC] No.45672474{6}[source]▶

>>45664202 #

Believe it or not, my employer likes what I'm doing so I'm still on the promotion track.

They seem more concerned with my ability to work on the company's bread and butter.

replies(1): >>45678216 #

45. fastball ◴[23 Oct 25 04:40 UTC] No.45678216{7}[source]▶

>>45672474 #

And you are better at working on the company's bread and butter with worse intuition?

46. nicman23 ◴[24 Oct 25 05:28 UTC] No.45691095{4}[source]▶

>>45667818 #

iommu does not matter ? the issue would be if one of those a100 fries.

↑