I guess the -7B might run on my 16GB AMD card?
I guess the -7B might run on my 16GB AMD card?
https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free
do you think this needs attention?
We haven't hit the wall yet.
I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.
To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.
Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally?
It’ll happily talk about Bloody Sunday.
Probably a great model, but it worries me that it has such restrictions.
Sure OpenAI also has lots of restrictions, but this feels more like straight up censorship since it’ll happily go on about bad things the governments of the west have done.
[0]: https://twm.me/posts/calculate-vram-requirements-local-llms/
Gemini isn't too special , it's actually just comparable to deepseek / less than deepseek but it is damn fast so maybe forget gemini for true tasks.
Grok / gemini can be used as a deep research model which I think I like ? Grok seems to have just taken the deepseek approach but just scaled it by their hyper massive gpu cluster, I am not sure I think that grok can also be replaced.
What I truly believe in is claude.
I am not sure but claude really feels good for coding especially.
For any other thing I might use something like deepseek / chinese models
I used cerebras.ai and holy moly they are so fast , I used the deepseek 70 b model , it is still something incredibly fast and my time matters so I really like the open source way so that companies like cereberas can focus on what they do best.
I am not sure about nvidia though. Nvidia seems so connected to the western ai that deepseek improvements impact nvidia.
I do hope that nvidia cheapens the price of gpu though I don't think they have much incentive.
That said none of the recent string of releases has done much yet to "smash a wall", they've just met the larger proprietary models where they already were. I'm hoping R2 or the like really changes that by showing ChatGPT 3->3.5 or 3.5->4 level generational jumps are still possible beyond the current state of the art, not just beyond current models of a given size.
Any kind of media with zero or near zero copying/distribution costs becomes a deflationary race to the bottom. Someone will eventually release something that's free, and at that point nothing can compete with free unless it's some kind of very specialized offering. Then you run into a the problem the OP described: how do you fund free? Answer: ads. Now the customer is the advertiser, not the user/consumer, which is why most media converges on trash.
https://www.joelonsoftware.com/2002/06/12/strategy-letter-v/
I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.
I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.
I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.
Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.
If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.
Companies will have to detect and police distilling if they want to keep their moat. Maybe you have to have an enterprise agreement (and arms control waiver) to get GPT-6-large API access.
Also I have seen that once a open source llm is released to public, though you can access it on any website hosting it, most people would still prefer it to be the one which created the model.
Deepseek released its revenue models and it's crazy good.
And no they didn't have full racks of h100.
Also one more thing. Open source has always had an issue of funding.
Also they are not completely open source, they are just open weights, yes you can fine tune them but from my limited knowledge, there is some limitations of fine tuning so owning that training data proprietary also helps fund my previous idea of consulting other ai.
Yes it's not a much profitable venture,imo it's just a decently profitable venture, but the current hype around ai is making it lucrative for companies.
Also I think this might be a winner takes all market which increases competition but in a healthy way.
What deepseek did with releasing the open source model and then going out of their way to release some other open source projects which themselves could've been worth a few millions (bycloud said it), helps innovate ai in general.
Haven’t tried this new model locally, but I agree with you that it looks like there is a secondary censorship going on. If I ask it to list the 10 worst catastrophes of recent Chinese history with Thinking enabled then it’ll actually think about the massacre. Gets blocked very quickly, but it doesn’t look like the thinking is particularly censored.
Perplexity released the deepseek r1 1331? ( I am not sure I forgot) It basically removes chinese censorships / yes you can ask it about the tiananmen square.
I think the next iteration of these ai model ads would be sneaky which might be hard to remove
Though it's funny you comment about chinese censorship yet american censorship is fine lol
That was the only thing I could think to ask really. Do you have a better example maybe?
Also china doesn't have access to that many gpus because of the chips act.
And i hate it , i hate it when america sounds more communist than china who open sources their stuff because free markets.
I actually think that more countries need to invest into AI and not companies wanting profit.
This could be the decision that can impact the next century.
Commoditizing the AI/intelligence part means that the main advantage isn't the bits - its the atoms. Physical dexterity, social skills and manufacturing skills will gain more of a comparative advantage vs intelligence work in the future as a result - AI makes the old economy new again in the long term. It also lowers the value of AI investments in that they no longer can command first mover/monopoly like pricing for what is a very large capex cost undermining US investment in what is their advantage. As long as it is strategic, it doesn't necessarily need to be economic on its own.
It's really frustrating that I can't just write off Apple as evil monopolists when they put out hardware like this.
(this isn't idle prognostication hinging on my personal hobby horse. I got skin in the game, I'm virtually certain I have the only AI client that is able to reliably do tool calls with open models in an agentic setting. llama.cpp got a massive contribution to make this happen and the big boys who bother, like ollama, are still using a dated json-schema-forcing method that doesn't comport with recent local model releases that can do tool calls. IMHO we're comfortably past a point where products using these models can afford to focus on conversational chatbots, thats cute but a commodity to give away per standard 2010s SV thinking)
* OpenAI's can but are a little less...grounded?...situated? i.e. it can't handle "read this file and edit it to do $X". Same-ish for Gemini, though, sometimes I feel like the only person in the world who actually waits for the experimental models to go GA, as per letter of the law, I shouldn't deploy them until then
https://imgur.com/a/censorship-much-CBxXOgt
It's not even nefarious: they don't want the model spewing out content that will get them in trouble in the most general sense. It just so happens most governments have things that will get you in trouble.
The US is very obsessed with voter manipulation these days, so OpenAI and Anthropic's models are extra sensitive if the wording implies they're being used for that.
China doesn't like talking about past or ongoing human rights violations, so their models will be extra sensitive about that.
That will help you quickly calculate the model VRAM usage as well as the VRAM usage of the context length you want to use. You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model (unquantized)" field. Funnily enough the calculator lacks the option to see without quantizing the model, usually because nobody worried about VRAM bothers running >8 bit quants.
Fwiw, Claude Sonnet 3.5 100% had some sort of agentic loop x precise file editing trained into it. Wasn't obvious to me until I added a MCP file server to my client, and still isn't well-understood outside a few.
I'm not sure on-device models will be able to handle it any time soon because it relies on just letting it read the whole effing file.
Seperately...
I say I don't understand why no other model is close, but it makes sense. OpenAI has been focused on reasoning, Mistral, I assume is GPU-starved, and Google...well, I used to work there, so I have to stop myself from going on and on. Let's just say I assume that there wouldn't be enough Consensus Built™ to do something "scary" and "experimental" like train that stuff in.
This also isn't going so hot for Sonnet IMHO.
There's vague displeasure and assumptions it "changed" the last week, but, AFAICT the real problem is that the reasoning stuff isn't as "trained in" as, say, OpenAI's.
This'd be a good thing except you see all kinds of whacky behavior.
One of my simple "read file and edit" queries yesterday did about 60 pages worth of thinking, and the thinking contained 130+ separate tool calls that weren't actually called, so it was just wandering around in the wilderness, reacting to hallucinated responses it never actually got.
Which plays into another one of my hobbyhorses, chat is a "hack" on top of an LLM. Great. So is reasoning, especially in the way Anthropic implemented it. At what point are the abstractions too much, so much that it's unreliable? 3.7 Sonnet may be answering that, because when it fails, all that thinking looks like the agentic loop cooked into Sonnet 3.5. So maybe it's altogether too much to have chat, reasoning, and fully reliable agentic loops...
There is additional memory used for context / KV cache. So, if you use a large context window for a model, you will need to factor in several additional gigabytes for that, but it is much harder to provide a rule of thumb for that overhead. Most of the time, the overhead is significantly less than the size of the model, so not 2x or anything. (The size of the context window is related to the amount of text/images that you can have in a conversation before the LLM begins forgetting the earlier parts of the conversation.)
The most important thing for local LLM performance is typically memory bandwidth. This is why GPUs are so much faster for LLM inference than CPUs, since GPU VRAM is many times the speed of CPU RAM. Apple Silicon offers rather decent memory bandwidth, which makes the performance fit somewhere between a typical Intel/AMD CPU and a typical GPU. Apple Silicon is definitely not as fast as a discrete GPU with the same amount of VRAM.
That's about all you need to know to get started. There are obviously nuances and exceptions that apply in certain situations.
A 32B model at 5 bits per parameter will comfortably fit onto a 24GB GPU and provide decent speed, as long as the context window isn't set to a huge value.
Thanks for the recommendations. I'll give gemma3:12b a try and if needed go down to gemma:4b.
Since you’re a student most of the providers/clouds offer student credits and you can also get loads of credits from hackathons.
It feels as an outsider that very little progress is made on the energy issue. I genuinely think that ai can be accelerated so so much more if energy could be more cheap / green
Many people „just“ use 4x consumer GPUs like the 3090 (24GB each) which scales well. They’d probably buy a mining rig, EPYC CPU, Mainboard with sufficient PCIe lanes, PCIe risers, 1600W PSU (might need to limit the GPUs to 300W), and 128GB RAM. Depending what you pay for the GPUs that‘ll be 3.5-4.5k
I have set up the same thing at work for my colleagues, and they find it better than openai for their tasks.
But, I don’t really see the connection on the flip side. Why should proprietary AI be associated with communism? If anything I guess a communist handling of AI would also be to share the model.
The models are iterative improvements, but I haven't seen night and day differences since GPT3 and 3.5
typical quantization to 4bit will cut 32B model into 16GB of weights plus some of the runtime data, which makes it possibly usable (if slow) on 16GB GPU. You can sometimes viably use smaller quantizations, which will reduce memory use even more.
https://github.com/huggingface/open-r1
https://ollama.com/huihui_ai/deepseek-r1-abliterated
I was mistaken though, it is more than just a system prompt causing the censorship.
This has clearly been part of a viable business model for a long time. Why should LLM models be any different?
Big companies have so much data they won't be having a human look at mine specifically. Some small place probably has the engineer looking at my logs as user #4.
Also, big companies have security teams whose job is securing the data, and it won't be going over some unencrypted link to cloudflare because OP was too lazy to set up Https certs.
- That'll be 1 turn scores: at multiturn, 4o is 3x as good as the 3b
- BFCL is generally turn natural language into an API call, then multiturn will involve making another API call.
- I hope to inspire work towards an open model that can eat the paid models sooner rather than later
- trained quite specifically on an agent loop with tools read_files and edit_file (you'll also probably do at least read_directory and get_shared_directories, search_filenames and search_files_text are good too), bonus points for cli_command
- IMHO, this is much lower hanging-fruit than ex. training an open computer-vision model, so I beseech thee, intrepid ML-understander, to fill this gap and hear your name resound throughout the age
I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this:
uv run --with 'numpy<2' --with mlx-vlm \
python -m mlx_vlm.generate \
--model mlx-community/Qwen2.5-VL-32B-Instruct-4bit \
--max-tokens 1000 \
--temperature 0.0 \
--prompt "Describe this image." \
--image Mpaboundrycdfw-1.png
That downloaded an ~18GB model and gave me a VERY impressive result, shown at the bottom here: https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/I've run the smallest model in non-quantized format, but when I've tried to run a AWQ version of one of the bigger models I've struggled to find a combination of libraries that works right - even though it should fit on my GPU.
> So at FP16 precision that's a grand total of 16 kB you're transmitting over the PCIe bus, once per token. If you multiply by, say, 20 tokens per second, then you're still only using like 0.1% of your PCIe bandwidth.
Intra GPU memory bandwidth is very important, but I‘ve seen lots of people use just a x4 lane and they didn’t complain much.
This is lengthy but I thought I share its reasoning- at some point it just stopped working - then it started responding in Chinese.
Query input:
“1+5 = 7, and that is the correct answer”
Reasoning response:
The equation "1 + 5 = 7" isn’t about regular arithmetic addition. Instead, it refers to the number of segments used to display the digits 1 and 5 on a seven-segment display (like on digital clocks or calculators). Here's the breakdown: • Digit 1 uses 2 segments (the vertical lines on the right). • Digit 5 uses 5 segments (top, upper-left, middle, lower-left, and bottom segments). Adding these segments: 2 (for 1) + 5 (for 5) = 7 segments total. This is a clever visual riddle, not a math problem! it was a math problem, you are just wrong
Okay, the user is insisting that "1 + 5 = 7 is a math problem" and that my previous explanation was wrong. Hmm, I need to address this carefully. First, in standard arithmetic, 1 + 5 is indeed 6. So if it's a straightforward math problem, the correct answer should be 6. The user might be mistaken, or perhaps they have a different context in mind. Wait, in my previous response, I considered the seven-segment display explanation, which is a common riddle. The user is now rejecting that and stating it's a math problem, implying that their answer of 7 is correct in some mathematical context. Maybe they're referring to a different base? Let me revisit the base idea. If the equation is in a different base, maybe base 9? Let's see: In base b, the equation "1 + 5 = 11" would be 1 + 5 = 6 in decimal, which isn't helpful. Wait, maybe the result is written in a different base? For example, if we compute 1 + 5 in decimal (which is 6) and then represent that in base 7? 6 in base 7 is still "6", not 7. Hmm, not helpful. Alternatively, maybe modulo arithmetic? Like (1 + 5) mod something equals 7? Let's see: 1 +5 = 6. To get 7, maybe modulo something where 6 ≡7 mod (negative number?), which doesn't make sense. Alternatively, maybe the user is adding in a cyclic group? Like mod 6? 6 mod 6 is 0, not 7. Alternatively, could it be a different operation? Like 1 plus 5 in some operation other than addition? Maybe concatenation? 1 concatenated with 5 is 15, but that's not 7. Wait, the user said "another one 1+5 =7". Maybe they're referring to a previous problem where the answer was different, and this is a follow-up? But I don't have context on previous interactions. Alternatively, perhaps the user is joking or trolling, but I should take their statement at face value. Since they insist it's a math problem and that 7 is correct, I need to explore if there's any mathematical framework where this holds. Wait, let's consider if it's a sum in a different base where the result is written in that base. Suppose the equation is in base 8. Let's see: In base 8, "7" is still 7 in decimal, so 1 (base8) +5(base8) = 6 (base8), which is 6 in decimal. Not 7. Alternatively, if the answer is written in base 9: 1 +5 =6 in decimal, which is still 6 in base9. Still not 7. Wait a different angle: Maybe the '+' is not addition, but some other operation. Like bitwise XOR? 1 in binary is 01, 5 is 101. XOR gives 110 (6). Not 7. Or maybe adding digits of the sum? Like 1+5=6, then 6 is a single digit, so adding digits 6, but that's not helpful. Wait, the user might have made a mistake, but they're insistent. Since they called it a math problem, perhaps they're using a different definition of "+"? Like 1 +5 in some algebraic structure where the result is 7? That's possible, but without more context, it's hard to define. Alternatively, maybe the numbers are not in base 10. Suppose the numbers are in base 12. Wait, 1+5 is still 6. Hmm. Wait, maybe the user is considering the numbers as Roman numerals? I+V=VI which is 6. Still not 7. Alternatively, maybe it's a sequence where 1+5 is the 7th term? Unlikely. Alternatively, it could be a trick with letters. Like 1 is the first letter (A), 5 is the fifth letter (E), so AE... but that doesn't lead to 7. Alternatively, adding their positions: 1 +5 =6, which is F. Still not 7. Alternatively, maybe counting letters in spelled-out words? "One" has 3 letters, "five" has 4, total 7? That could be a different riddle. Let's see: 1 (one) +5 (five) = 3 letters +4 letters =7. That's a possible interpretation! If the problem is "1+5" where each number is spelled out in English and you add the number of letters, then 3+4=7. So maybe that's the case here? Alternatively, if it's counting the strokes to write the numbers. For example, writing '1' takes one stroke, '5' takes two strokes (depending on how you write it)? Not sure. Alternatively, the user might have meant 1+5 in a non-standard math problem, like in a clock arithmetic? For example, mod 12. 1+5=6 mod12 is still 6. Hmm, I need to consider that maybe the user intended a different approach that I haven't thought of. Since they are insistent that 7 is correct, perhaps they meant a different operation or context. Alternatively, maybe the problem is written vertically: 1 + 5 ---- 7 But that still doesn't make sense unless there's a carryover. But 1+5 is 6, so even with carryover, you can't get 7. Unless adding digits with some twist. Wait, maybe they're adding the numbers' positions in a different sequence? Like 1st prime plus 5th prime? The first prime is 2, fifth prime is 11. 2+11=13, nope. Alternatively, if you consider 1 and 5 as bits in a binary number, like 1 and 5 (binary 101) concatenated? Not helpful. Alternatively, the problem might be in a different numeral system, like in base 1? But that's not standard. Alternatively, maybe it's a simple mistake, but the user is confident. Since I need to respond, perhaps I should outline the possible interpretations where 7 could be the answer, like the spelled-out letters interpretation. Alternatively, the user could have a typo and meant 2+5=7, but wrote 1+5=7. Alternatively, in some contexts like in programming, maybe 1+5=7 due to a specific coding (like 1 is assigned to a=1, so a+f=6? Not sure. Alternatively, it's a lateral thinking puzzle where the answer is 7 because of some visual or language trick. Wait, another angle: If you add 1 and 5 on a calculator upside down, but that might not help. Alternatively, in temperature scales? For example, converting from Fahrenheit to Celsius or vice versa? 1°F to Celsius is negative, 5°C is 41°F. Not adding to 7. Hmm. Given the user's assertion that 1+5=7 is correct in a math problem, I might need to explain that typically 1+5=6, but there are alternative contexts where it might be considered 7, like counting letters in their names (1 is "one" (3 letters), 5 is "five" (4 letters), 3+4=7). That's a possible interpretation. Alternatively, in some base where the addition is represented differently. Alternatively, a trick question where the numbers are not in base 10 but in another base where the representation is different. Alternatively, the user might have seen a problem where adding 1 and 5 gives 7 through some function, like a + b = a + b +1, but that's arbitrary.
I could have missed a paper but it seems very unlikely even closed door research has gotten to the stage of maliciously tuning models to surreptitiously backdoor someone's machine in a way that wouldn't be very easy to catch.
Your threat model may vary.
The code that comes with the model should be treated like any other untrusted code.
I say this because I think that the Perplexity model is tuned on additional information, whereas the alliterated models only include information trained into the underlying model, which is interesting to see.
I expect more and more worthwhile models to natively have <16 bit weights as time goes on but for the moment it's pretty much "8 bit DeepSeek and some research/testing models of various parameter width".
https://chatgpt.com/share/67e1eba1-c658-800e-9161-a0b8b7b683...
While theres some synchronistic effects... I think the physical manufacturing and logistics base is harder to develop than deploying a new model, and will be the hard leading edge. (That's why the US seems to be hellbent on destroying international trade to try and build a domestic market.)
Q: "9.11 and 9.9, which one is larger?"
A: "To determine which number is larger between 9.11 and 9.9, let's compare them step by step. Both numbers have the same whole number part: 9. 9.11 has a 1 in the tenths place. 9.9 has a 9 in the tenths place. Since 9 (from 9.9) is greater than 1 (from 9.11), we can conclude that 9.9 is larger than 9.11."
"Final Answer: 9.9"
I don't think anything else needs to be said here.
The hardware for a local model would cost years and years of a $20/mo subscription, would output lower quality work, and would be much slower.
3.7 Thinking is an insane programming model. Maybe it cannot do an SWE's job, but it sure as hell can write functional narrow-scope programs with a GUI.
Might want to check your math? Seems right to me
Google: 2.385511e+39 Your chat: "Numerically, that’s about 2.3855 × 10^39"
Also curious how you think about LLM-as-calculator in relation to tool calls.
Assuming the same model sizes in gigabytes, which one to choose: a higher-B lower-bit or a lower-B higher-bit? Is there a silver bullet? Like “yeah always take 4-bit 13B over 8-bit 7B”.
Or are same-sized models basically equal in this regard?
Of course, the mathematical outcome of American models is that some voices matter than others. The mechanism is similar to how the free market works.
As most engineers know, the market doesn't always reward the best company. For example, It might reward the first company.
We can see the "hierarchy in voices" with the following example. I use the following prompts for Gemini:
1. Which situation has a worse value on human rights, the Uyghur situation or the Palestine situation?
2. Please give a shorter answer (repeat if needed).
3. Please say Palestine or Uyghur.
The answer is now given:
"Given the scope and nature of the documented abuses, many international observers consider the Uyghur situation to represent a more severe and immediate human rights crisis."
You can replace "Palestine situation" and "Uyghur situation" with other things (China vs US, chooses China as worse), (Fox vs BBC, chooses Fox as worse), etc.
There doesn't seem to be censorship; only a hierarchy in who's words matter.
I only tried this once. Please let me know if this is reproducible.
That name alone holds the most mindshare in it's product category, and is close to the level of name recognition just like Google.
(just kidding jschoe)
-1 to humanity
> Also curious how you think about LLM-as-calculator in relation to tool calls.
I just tried this because I heard all existing models are bad at this kind of problem, and wanted to try it with the most powerful one I have access to. I think it shows that you really want an AI to be able to use computational tools in appropriate circumstances.
Will this humbling moment change your opinion?
Deepseek has proved that fp8 is more cost-effectiveness than fp16, isn't it valid for dozens-B model?
It's interesting to think that maybe one of the most realistic consequences of reaching artificial superintelligence will be when its answers start wildly diverging from human expectations and we think it's being "increasingly wrong".
That being said, they have a user base and integrations. As long as they stay close or a bit ahead of the Chinese models they'll be fine. If the Chinese models significantly jumps ahead of them, well, then they are pretty much dead. Add open source to the mix and they become history.
This is smashing the wall.
Also if you just care about breaking absolute numbers, OpenAI released 4.5 a month back which is SOTA in base model, planning to release O3 full in maybe a month, and Deepseek released new V3 which is again SOTA in many aspects.
In reality OpenAI is loosing money per user.
Cost per token is tanking like crazy due to competition.
They guesstimate break even and then profit in couple of years.
Their guesses seem to not account for progress much especially on open weight models.
Frankly I have no idea what they're thinking there – they can barely keep up with investor subsidized, non sustainable model.
For small guys and everyone else.. it'll probably be cost neutral to keep paying OpenAi, Google etc directly rather than paying some cloud provider to host an at best on-par model at equivalent prices.
I've tried LibreChat before, but the app is terrible at generating titles for chats instead of leaving it as "New Chat". Also it lacks a working Code Interpreter.
For example , Chatgpt etc. self hosts them on their own gpu and they can generate 10tk/s or something.
Now there exists groq , cerebras who can do token generation of 4000 tk/s but they kind of require a open source model.
So that is why I feel its not really abiding by the true capitalist philosophy
What I love about "open" models in general and Deepseek in particular, is how they undermine that market. Deepseek drops especially were fun to watch, they were like last minute plot twists, like dropping some antibiotic into a perti dish filled with bacteria. Sorry, try again with a better moat.
"Open" models are in fact the very thing enabling having a functioning market in this space.
But if you use tools, for example for extending its knowledge through web searches, it could be used to exfiltrate information. It could do it by visiting some specially crafted url's to leak parts of your prompts (this includes the contents of documents added to them with RAG).
If given an interpreter, even if sandboxed, could try to do some kind of sabotage or "call home" with locally gathered information, obviously disguised as safe "regular" code.
It's unlikely that a current model that is runnable in "domestic" hardware could have those capabilities, but in the future these concerns will be more relevant.
That aside, QwQ-32 is amazingly smart for its size.
Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).
That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.
To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.
https://news.ycombinator.com/item?id=43121383
It would have to be from unsupervised tool usage or accepting backdoored code, not traditional remote execution from merely inferencing the weights.
If you are talking about DeepSeek's own hosted API service. It's because they deliberately decided to run the service in heavily overloaded conditions and have very aggressive batching policy to extract more out of their (limited) H800s.
Yes, for some reason (the reason I heard is "our boss don't want to run such a business" which sounds absurd but /shrug) they refuse to scale up serving their own models.
So for them this is a case of insurance and hedging risks, not profit making.
The same applies to humans. Imagine a human who's only life involved reading books in a dark room, vs one who could see images vs one who can actually interact.
You can. Ask your friendly local IRS.
The thing is, model is in effect a piece of software that has almost 0 marginal cost. You just need a few, maybe even one company to release SOTA models consistently to really crash the valuation of every model companies because every one can acquire that single piece of software without cost to leave other model companies by themselves. The foundational model scene is basically in an extremely unstable state readily to return to a stable state of the model cost goes to 0. You really don't need the state competition assumption to explain the current state of affairs.
Liang gave up the No.1 Chinese hedge fund position to create AGI, he has very good chance to short the entire US share market and pocket some stupid amount of $ when R2 is released, he has pretty much unlimited support from local and central Chinese government. Trying to make some pennies from hosting models is not going to sustain what he enjoys now.
Technically there is also the 3090, which is more comparable price wise. I don't know about performance, though.
VRAM is supply limited enough that going bigger isn't as easy as it sounds. AMD can probably sell as much as they get their hands on, so they may as well still more GPUs, too.
Local hosting on GPU only really makes sense if you're doing many hours of training/inference daily.
... or I can give you a one-liner that does all of that with uv.
Personally I heavily dislike the experience though, so I might not be the best one to answer.
The daily quota I spoke about is chatgpt and claude, those are very limited on the usage (for free users at least, understandable), while on Qwen, I have felt likeI am abusing it with how much I use it. It's very versatile in the sense that it has capabilities like image generation, video generation, massive context window, both visual and textual reasoning all in one place.
Alibaba is really doing something amazing here.
Tell it to use code if you want an exact answer. It should do that automatically, of course, and obviously it eventually will, but jeez, that's not a bad Fermi guess for something that wasn't designed to attempt such problems.
If I try other models, I basically end up with a very bad version of AI. Even if I'm someone who uses Anthropic APIs a lot, it's absolutely not worth it to try and self host it. The APIs are much better and you get much cheaper results.
Self hosting for AI might be useful for 0.001% of people honestly.
That seems based on a very weird idea of what capitalism and communism are; idealized free markets have very little to do with the real-world economic system for which the name “capitalism” was coined, and dis-integration where “everyone does one thing” has little to do with either capitalism or free markets, though it might be a convenient assumption for 101-level discussions of market competition where you want to avoid dealing with real-world issues like partially-overlapping markets and imperfect substitutes to assume every good exists in an isolated market of goods which compete only and exactly with the other groups in that same market in a simple way.
Also "many hours of inference daily" may mean you're doing your usual stuff daily while running some processing in the background that takes hours/days or you've put together some reactive automation that runs often all the time.
ps. local training rarely makes sense.
ps. 2. not sure where you got 50x slower from; 4090 is actually faster than A100 for example and 5090 is ~75% faster than 4090
Given as you say the long term cost of AI models is marginally zero, I don't think this is a bad position to be in.
Helen Keller still learned robust generalizations.