Although, I must say that for certain docker pass through cases, the debugging logs just aren't as detailed
What fundamentally solves the issue is to use an onnx version of the model.
And like a junior dev it ran into some problems and needed some nudges. Also like a junior dev it consumed energy resources while doing it.
In the end I like that the chunk size of work that we can delegate to LLMs is getting larger.
There are people who don't get blocked waiting for external input in order to get tasks like this done, which I think is the intended comparison. There's a level of intuition that junior devs and LLMs don't have that senior devs do.
Sometimes looking at the same type of code and the same infra day in and day out makes you rusty. In my olden days, I did something different every week, and I had more free time to experiment.
No more fighting with hardcoded cuda:0 everywhere.
The only pain point is that you’ll often have to manually convert a PyTorch model from huggingface to onnx unless it’s very popular.
It’s just that when I review the code, I would do things differently because the agent doesn’t have experience with our codebase. Although it is getting better at in-context learning from the existing code, it is still seeing all of it for the “first time”.
It’s not a junior dev, it’s just a dev perpetually in their first week at a new job. A pretty skilled one, at that!
and a lot of things translate. How well do you onboard new engineers? Well written code is easier to read and modify, tests helps maintain correctness while showing examples, etc.
If they know what they're doing and it's not an exploratory task where the most efficient way to do it is by trial and error? Quite a few. Not always, but often.
That skill seems to have very little value in today's world though.
HuggingFace has incredible reach but poor UX, and PyTorch installs remain fragile. There’s real space here for a platform that makes this all seamless maybe even something that auto-updates a local SSD with fresh models to try every day.
I've had the same "problem" and feel like this is the major hazard involved. It is tricky to validate the written work Claude (or any other LLM) produces due to high levels of verbosity, and the temptation to say "well, it works!"
As ever though, it is impressive what we can do with these things.
If I were Simon, I might have asked Claude (as a follow up) to create a minimal ansible playbook, or something of that nature. That might also be more concise and readable than the notes!
There’s also a factor of the young being very confident that they’re right ;)
> Claude declared victory and pointed me to the output/result.mmd file, which contained only whitespace. So OCR had worked but the result had failed to be written correctly to disk.
Given the importance of TDD in this style of continual agentic loop - I was a bit surprised to see that the author only seems to have provided an input but not an actual expected output.
Granted this is more difficult with OCR since you really don't know how well DeepSeek-OCR might perform, but a simple Jaccard sanity test between a very legible input image and expected output text would have made it a little more hands-off.
EDIT: After re-reading the article, I guess this was more of a test to see if DeepSeek-OCR would run at all. But I bet you could setup a pretty interesting TDD harness using the aforementioned algorithm with an LLM in a REPL trying to optimize Tesseract parameters against specific document types which was ALWAYS such a pain in the past.
NVidia is a hardware company at heart.
They do create and sell amazing hardware.
But like almost all hardware makers I know, they totally suck at software. Their ecosystem is an effing nightmare (drivers, compilers, etc...). It's a pure culture issue, where:
a) software always comes as an afterthought
b) the folks in charge of engineering are largely HW folks. They like like HW engineers and the resulting software stack looks exactly like a piece of silicon: opaque, static, inflexible and most important of all, never designed to be understood/looked at/reworked.
I suspect the reason why all their software is closed-source is not for commercial reasons, they're just ashamed as a company to show the world how shitty their SWE skills are.edit: did you see my comment when it was first posted? The topic was being dumped on at that time and drowning out the signal. I promise I'm not part of a dark pattern.
> From the paper: Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.
It's main purpose is: a compression algorithm from text to image, throw away the text because it costs too many tokens, keep the image in the context window instead of the text, generate some more text, when text accumulates even more compress the new text to image and so on.
The argument is, pictures store a lot more information than words, "A picture is worth a thousand words" after all. Chinese characters are pictograms, it doesn't seem that strange to think that, but I don't buy it.
I am doing some experiments of removing text as an input for LLMs and replacing with it's summary, and I have reduced the context window by 7 times already. I am still figuring what it the best way to achieve that, but 10 times is not far off. My experiments involve novel writing not general stuff, but still it works very well just replacing text with it's summary.
If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!
I mean, some really dense languages basically do, like APL using symbols we (non-mathematicians) rarely even see.
That's wildly different from:
> I see a lot of snark in the comments
The first is essentially "People should talk about the actual content more" while the second is "People are mocking the submission". If no other top-level comments actually seem like mocking, it seems fair that someone reacts to the second sentiment
> The topic was being dumped on at that time and drowning out the signal.
There was a whole of 7 comments when you made yours (https://ditzes.com/item/45646559), most of them children comments, none of them mocking the submission nor the author.
> within a 10× compression ratio, the model’s decoding precision can reach approximately 97%, which is a very promising result. In the future, it may be possible to achieve nearly 10× lossless contexts compression through text-to-image approaches.
Graphs and charts should be represented as math, i.e. text, that's what they are anyway, even when they are represented as images, it is much more economical to be represented as math.
The function f(x)=x can be represented by an image of (10pixels x 10pixels) dimensions, (100pixels x 100pixels) or (infinite pixels x infinite pixels).
A math function is worth infinite pictures.
Also the machine is well north of 100K when you include the RF ADCs and DACs in there that run a radar.
Worst case, I have multiple.