Most active commenters

ttoinou(5)
zackify(5)
dismalaf(4)
nico(4)
johnQdeveloper(4)
simonw(3)
jwr(3)

Popular/hot comments

>>44056502 #
>>44054351 #
>>44056250 #
>>44054416 #
>>44053886 #
>>44057053 #
>>44058247 #
>>44054028 #
>>44054477 #
>>44053211 #
>>44056372 #
>>44058499 #
>>44058796 #
>>44059264 #

Devstral

(mistral.ai)

1. AnhTho_FR ◴[21 May 25 15:10 UTC] No.44052330[source]▶

>>44051733 (OP) #

Impressive performance!

2. ddtaylor ◴[21 May 25 15:19 UTC] No.44052444[source]▶

>>44051733 (OP) #

Wow. I was just grabbing some models and I happened to see this one while I was messing with tool support in LLamaIndex. I have an agentic coding thing I threw together and I have been trying different models on it and was looking to throw ReAct at it to bring in some models that don't have tool support and this just pops into existence!

I'm not able to get my agentic system to use this model though as it just says "I don't have the tools to do this". I tried modifying various agent prompts to explicitly say "Use foo tool to do bar" without any luck yet. All of the ToolSpec that I use are annotated etc. Pydantic objects and every other model has figured out how to use these tools.

replies(1): >>44055047 #

3. abrowne2 ◴[21 May 25 16:31 UTC] No.44053211[source]▶

>>44051733 (OP) #

Curious to check this out, since they say it can run on a 4090 / Mac with >32 GB of RAM.

replies(3): >>44053250 #>>44056730 #>>44059683 #

4. ddtaylor ◴[21 May 25 16:34 UTC] No.44053250[source]▶

>>44053211 #

I can run it without issue on a 6800 XT with 64GB of RAM.

5. simonw ◴[21 May 25 17:30 UTC] No.44053886[source]▶

>>44051733 (OP) #

The first number I look at these days is the file size via Ollama, which for this model is 14GB https://ollama.com/library/devstral/tags

I find that on my M2 Mac that number is a rough approximation to how much memory the model needs (usually plus about 10%) - which matters because I want to know how much RAM I will have left for running other applications.

Anything below 20GB tends not to interfere with the other stuff I'm running too much. This model looks promising!

replies(4): >>44054806 #>>44056502 #>>44059216 #>>44059888 #

6. gyudin ◴[21 May 25 17:32 UTC] No.44053907[source]▶

>>44051733 (OP) #

Super weird benchmarks

replies(1): >>44053952 #

7. avereveard ◴[21 May 25 17:36 UTC] No.44053952[source]▶

>>44053907 #

from what I gather it's finetuned to use OpenHand specifically so shows value on thsoe benchmark that target a whole system as a blackbox (i.e. agent + llm) more than directly target the llm input/outputs

replies(1): >>44057074 #

8. solomatov ◴[21 May 25 17:38 UTC] No.44053997[source]▶

>>44051733 (OP) #

It's very nice that it has the Apache 2.0 license, i.e. well understood license, instead of some "open weight" license with a lot of conditions.

replies(1): >>44054272 #

9. ics ◴[21 May 25 17:40 UTC] No.44054028[source]▶

>>44051733 (OP) #

Maybe someone here can suggest tools or at least where to look; what are the state-of-the-art models to run locally on relatively low power machines like a MacBook Air? Is there anyone tracking what is feasible given a machine spec?

"Apple Intelligence" isn't it but it would be nice to know without churning through tests whether I should bother keeping around 2-3 models for specific tasks in ollama or if their performance is marginal there's a more stable all-rounder model.

replies(3): >>44054653 #>>44056458 #>>44058187 #

10. bravura ◴[21 May 25 17:51 UTC] No.44054167[source]▶

>>44051733 (OP) #

And how do the results compare to hosted LLMs like Claude 3.7?

replies(1): >>44054278 #

11. resource_waste ◴[21 May 25 18:01 UTC] No.44054278[source]▶

>>44054167 #

Eh, different usecase entirely. I don't really compare these.

replies(2): >>44055782 #>>44056372 #

12. ManlyBread ◴[21 May 25 18:07 UTC] No.44054351[source]▶

>>44051733 (OP) #

>Devstral is light enough to run on a single RTX 4090 or a Mac with 32GB RAM, making it an ideal choice for local deployment and on-device use

This is still too much, a single 4090 costs $3k

replies(5): >>44054416 #>>44054504 #>>44054518 #>>44054679 #>>44062005 #

13. Uehreka ◴[21 May 25 18:14 UTC] No.44054416[source]▶

>>44054351 #

> a single 4090 costs $3k

What a ripoff, considering that a 5090 with 32GB of VRAM also currently costs $3k ;)

(Source: I just received the one I ordered from Newegg a week ago for $2919. I used hotstocks.io to alert me that it was available, but I wasn’t super fast at clicking and still managed to get it. Things have cooled down a lot from the craziness of early February.)

replies(4): >>44054764 #>>44054768 #>>44056746 #>>44057220 #

14. oofbaroomf ◴[21 May 25 18:19 UTC] No.44054477[source]▶

>>44051733 (OP) #

The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run this for almost free, this is a very extraordinary model.

replies(3): >>44056216 #>>44056570 #>>44058287 #

15. fkyoureadthedoc ◴[21 May 25 18:21 UTC] No.44054504[source]▶

>>44054351 #

> a single 4090 costs $3k

I hope not. Mine was $1700 almost 2 years go, and the 5090 is out now...

replies(1): >>44054749 #

16. oezi ◴[21 May 25 18:22 UTC] No.44054518[source]▶

>>44054351 #

If it runs on 4090, it also runs on 3090 which are available used for 600 EUR.

replies(1): >>44054828 #

17. solomatov ◴[21 May 25 18:26 UTC] No.44054557{3}[source]▶

>>44054272 #

IMO, it's not about ethics, it's about legal risks. What if you want to fine tune a model on output related to your usage? Then my understanding is that all these derivatives need to be under the same license. What if G will change their prohibited use policy (the first line there is that they could update it from time to time)? There's really crazy stuff in terms of use of some services, what if G adds something in the same tune there which basically makes your application impossible.

P.S. I am not a lawyer.

18. orbisvicis ◴[21 May 25 18:31 UTC] No.44054604{3}[source]▶

>>44054272 #

I'm not sure what you're trying to imply... only rogue software developers use devstral?

19. YetAnotherNick ◴[21 May 25 18:34 UTC] No.44054652[source]▶

>>44051733 (OP) #

The SWE bench is super impressive of model of any size. However just providing one benchmark results and having to do partnership with OpenHands seems like they focused too much on optimizing the number.

20. thatcherc ◴[21 May 25 18:35 UTC] No.44054653[source]▶

>>44054028 #

I would recommend just trying it out! (as long as you have the disk space for a few models). llama.cpp[0] is pretty easy to download and build and has good support for M-series Macbook Airs. I usually just use LMStudio[1] though - it's got a nice and easy-to-use interface that looks like the ChatGPT or Claude webpage, and you can search for and download models from within the program. LMStudio would be the easiest way to get started and probably all you need. I use it a lot on my M2 Macbook Air and it's really handy.

[0] - https://github.com/ggml-org/llama.cpp

[1] - https://lmstudio.ai/

replies(1): >>44055956 #

21. orbisvicis ◴[21 May 25 18:37 UTC] No.44054679[source]▶

>>44054351 #

Is there an equivalence between gpu vram and mac ram?

replies(1): >>44056661 #

22. hnuser123456 ◴[21 May 25 18:43 UTC] No.44054749{3}[source]▶

>>44054504 #

The 4090 went up in price for a while as the 5000 marketing percolated and people wanted an upgrade they could actually buy.

23. IshKebab ◴[21 May 25 18:44 UTC] No.44054764{3}[source]▶

>>44054416 #

That's probably because the 5000 series seems to be a big let-down. It's pretty much identical to the 4000 series in efficiency; they've only increased performance by massively increasing power usage.

24. hiatus ◴[21 May 25 18:44 UTC] No.44054768{3}[source]▶

>>44054416 #

I receive NXDOMAIN for that hostname.

replies(1): >>44054960 #

25. lis ◴[21 May 25 18:48 UTC] No.44054806[source]▶

>>44053886 #

Yes, I agree. I've just ran the model locally and it's making a good impression. I've tested it with some ruby/rspec gotchas, which it handled nicely.

I'll give it a try with aider to test the large context as well.

replies(1): >>44055628 #

26. threeducks ◴[21 May 25 18:49 UTC] No.44054828{3}[source]▶

>>44054518 #

More like 700 € if you are lucky. Prices are still not back down from the start of the AI boom.

I am hopeful that the prices will drop a bit more with Intel's recently announced Arc Pro B60 with 24GB VRAM, which unfortunately has only half the memory bandwidth of the RTX 3090.

Not sure why other hardware makers are so slow to catch up. Apple really was years ahead of the competition with the M1 Ultra with 800 GB/s memory bandwidth.

replies(1): >>44060286 #

27. dismalaf ◴[21 May 25 18:58 UTC] No.44054925{3}[source]▶

>>44054272 #

It's not about ethical or not, it's about risk to your startup. Ethics are super subjective (and often change based on politics). Apache means you own your own model, period.

replies(1): >>44058796 #

28. jsheard ◴[21 May 25 19:01 UTC] No.44054960{4}[source]▶

>>44054768 #

It's hotstock.io, no plural.

29. dismalaf ◴[21 May 25 19:02 UTC] No.44054969[source]▶

>>44051733 (OP) #

It's nice that Mistral is back to releasing actual open source models. Europe needs a competitive AI company.

Also, Mistral has been killing it with their most recent models. I pay for Le Chat Pro, it's really good. Mistral Small is really good. Also building a startup with Mistral integration.

replies(1): >>44062682 #

30. tough ◴[21 May 25 19:08 UTC] No.44055047[source]▶

>>44052444 #

you can use constrained outptus for enforcing tool schemas any model can get it with a lil help

31. jadbox ◴[21 May 25 19:13 UTC] No.44055101[source]▶

>>44051733 (OP) #

But how does it compare to deepcoder?

32. simonw ◴[21 May 25 19:15 UTC] No.44055120{3}[source]▶

>>44054272 #

What's different between the ethics of Mistral and Gemma?

replies(1): >>44055684 #

33. CSMastermind ◴[21 May 25 19:22 UTC] No.44055203[source]▶

>>44051733 (OP) #

I don't believe the benchmarks they're presenting.

I haven't tried it out yet but every model I've tested from Mistral has been towards the bottom of my benchmarks in a similar place to Llama.

Would be very surprised if the real life performance is anything like they're claiming.

replies(2): >>44056495 #>>44057452 #

34. qwertox ◴[21 May 25 19:51 UTC] No.44055560[source]▶

>>44051733 (OP) #

Maybe the EU should cover the cost of creating this agent/model, assuming it really delivers what it promises. It would allow Mistral to keep focusing on what they do and for us it would mean that the EU spent money wisely.

replies(2): >>44055600 #>>44056791 #

35. Havoc ◴[21 May 25 19:52 UTC] No.44055571{3}[source]▶

>>44054272 #

They're all quite easy to strip of protections and I don't think anyone doing unethical stuff is big on following licenses anyway

36. Havoc ◴[21 May 25 19:54 UTC] No.44055600[source]▶

>>44055560 #

>Maybe the EU should cover the cost of creating this model

Wouldn't mind some of my taxpayer money flowing towards apache/mit licensed models.

Even if just to maintain a baseline alternative & keep everyone honest. Seems important that we don't have some large megacorps run away with this.

37. ericb ◴[21 May 25 19:56 UTC] No.44055628{3}[source]▶

>>44054806 #

In ollama, how do you set up the larger context, and figure out what settings to use? I've yet to find a good guide. I'm also not quite sure how I should figure out what those settings should be for each model.

There's context length, but then, how does that relate to input length and output length? Should I just make the numbers match? 32k is 32k? Any pointers?

replies(2): >>44056025 #>>44058487 #

38. Philpax ◴[21 May 25 20:01 UTC] No.44055684{4}[source]▶

>>44055120 #

I think their point was more that Gemma open models have restrictive licences, while some Mistral open models do not.

39. bufferoverflow ◴[21 May 25 20:11 UTC] No.44055782{3}[source]▶

>>44054278 #

Different class. Same exact use case.

40. portaouflop ◴[21 May 25 20:16 UTC] No.44055844{3}[source]▶

>>44054272 #

TIL Open Source is only used for unethical purposes

41. Etheryte ◴[21 May 25 20:27 UTC] No.44055956{3}[source]▶

>>44054653 #

This doesn't do anything to answer the main question of what models they can actually run.

replies(1): >>44057233 #

42. lis ◴[21 May 25 20:33 UTC] No.44056025{4}[source]▶

>>44055628 #

For aider and ollama, see: https://aider.chat/docs/llms/ollama.html

Just for ollama, see: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

I’m using llama.cpp though, so I can’t confirm these methods.

replies(1): >>44056519 #

43. falcor84 ◴[21 May 25 20:53 UTC] No.44056216[source]▶

>>44054477 #

Just to confirm, are you referring to Claude 3.7?

replies(1): >>44056250 #

44. oofbaroomf ◴[21 May 25 20:56 UTC] No.44056250{3}[source]▶

>>44056216 #

No. I am referring to Claude 3.5 Sonnet New, released October 22, 2024, with model ID claude-3-5-sonnet-20241022, colloquially referred to as Claude 3.6 Sonnet because of Anthropic's confusing naming.

replies(4): >>44056271 #>>44056382 #>>44056760 #>>44061050 #

45. SkyPuncher ◴[21 May 25 20:59 UTC] No.44056271{4}[source]▶

>>44056250 #

> colloquially referred to as Claude 3.6

Interesting. I've never heard this.

replies(2): >>44057177 #>>44060064 #

46. ttoinou ◴[21 May 25 21:12 UTC] No.44056372{3}[source]▶

>>44054278 #

For which kind of coding would you use a subpar LLM ?

replies(3): >>44056796 #>>44058368 #>>44059135 #

47. Deathmax ◴[21 May 25 21:13 UTC] No.44056382{4}[source]▶

>>44056250 #

Also known as Claude 3.5 Sonnet V2 on AWS Bedrock and GCP Vertex AI

48. Miraste ◴[21 May 25 21:26 UTC] No.44056458[source]▶

>>44054028 #

The best general model you can run locally is probably some version of Gemma 3 or the latest Mistral Small. On a Windows machine, this is limited by VRAM, since system RAM is too low-bandwidth to run models at usable speeds. On an M-series Mac, the system memory is on-die and fast enough to use. What you can run will be the total RAM, minus whatever MacOS uses and the space you want for other programs.

To determine how much space a model needs, you look at the size of the quantized (lower precision) model on HuggingFace or wherever it's hosted. Q4_K_M is a good default. As a rough rule of thumb, this will be a little over half the size of the parameters, if they were in gigabytes. For Devstral, that's 14.3GB. You will also need 1-8GB more than that, to store the context.

For example: A 32GB Macbook Air could use Devstral at 14.3+4GB, leaving ~14GB for the system and applications. A 16GB Macbook Air could use Gemma 3 12B at 7.3+2GB, leaving ~7GB for everything else. An 8GB Macbook could use Gemma 3 4B at 2.5GB+1GB, but this is probably not worth doing.

replies(1): >>44059545 #

49. Ancapistani ◴[21 May 25 21:31 UTC] No.44056495[source]▶

>>44055203 #

I've worked with other models from All Hands recently, and I believe they were based on Mistral.

My general impression so far is that they aren't quite up to Claude 3.7 Sonnet, but they're quite good. More than adequate for an "AI pair coding assistant", and suitable for larger architectural work as long as you break things into steps for it.

50. nico ◴[21 May 25 21:32 UTC] No.44056502[source]▶

>>44053886 #

Any agentic dev software you could recommend that runs well with local models?

I’ve been using Cursor and I’m kind of disappointed. I get better results just going back and forth between the editor and ChatGPT

I tried localforge and aider, but they are kinda slow with local models

replies(6): >>44056637 #>>44057592 #>>44058473 #>>44059316 #>>44064049 #>>44071582 #

51. nico ◴[21 May 25 21:34 UTC] No.44056519{5}[source]▶

>>44056025 #

Are you using it with aider? If so, how has your experience been?

52. AstroBen ◴[21 May 25 21:41 UTC] No.44056570[source]▶

>>44054477 #

extraordinary.. or suspicious that the benchmarks aren't doing their job

replies(1): >>44057569 #

53. jabroni_salad ◴[21 May 25 21:50 UTC] No.44056637{3}[source]▶

>>44056502 #

Do you have any other interface for the model? what kind of tokens/sec are you getting?

Try hooking aider up to gemini and see how the speed is. I have noticed that people in the localllama scene do not like to talk about their TPS.

replies(2): >>44056857 #>>44081068 #

54. viraptor ◴[21 May 25 21:52 UTC] No.44056661{3}[source]▶

>>44054679 #

For loading models, it's exactly the same. Mac ram is fully (more or less) shared between CPU/GPU.

55. yencabulator ◴[21 May 25 22:02 UTC] No.44056730[source]▶

>>44053211 #

"Can run" is pretty easy, it's pretty small and quantized. It runs at 3.7 tokens/second on pure CPU with AMD 8945HS.

56. ttoinou ◴[21 May 25 22:05 UTC] No.44056746{3}[source]▶

>>44054416 #

I can get the 5090 for 1700 euros on Amazon Spain. But there is 95% chance it is a scammy seller :P

replies(1): >>44059706 #

57. ttoinou ◴[21 May 25 22:07 UTC] No.44056760{4}[source]▶

>>44056250 #

And it is a very good LLM. Some people complain they don't see an improvement with Sonnet 3.7

58. dismalaf ◴[21 May 25 22:11 UTC] No.44056791[source]▶

>>44055560 #

Pretty sure the EU paid for some supercomputers that AI startups can use and Mistral is partner in that program.

replies(1): >>44059447 #

59. troyvit ◴[21 May 25 22:12 UTC] No.44056796{4}[source]▶

>>44056372 #

I'd use a "subpar" LLM for any coding practice where I want to do the bulk of the thinking and where I care about how much coal I'm burning.

It's kind-of like asking, for which kind of road-trip would you use a Corolla hatchback instead of a Jeep Grand Wagoneer? For me the answer would be "almost all of them", but for others that might not be the case.

replies(1): >>44057291 #

60. nico ◴[21 May 25 22:20 UTC] No.44056857{4}[source]▶

>>44056637 #

The models feel pretty snappy when interacting with them directly via ollama, not sure about the TPS

However I've also ran into 2 things: 1) most models don't support tools, sometimes it's hard to find a version of the model that correctly uses tools, 2) even with good TPS, since the agents are usually doing chain-of-thought and running multiple chained prompts, the experience feels slow - this is even true with Cursor using their models/apis

61. TZubiri ◴[21 May 25 22:51 UTC] No.44057053[source]▶

>>44051733 (OP) #

I feel this is part of a larger and very old business trend.

But do we need 20 companies copying each other and doing the same thing?

Like, is that really competition? I'd say competition is when you do something slightly different, but I guess it's subjective based on your interpretation of what is a commodity and what is proprietary.

To my view, everyone is outright copying and creating commodity markets:

OpenAI: The OG, the Coke of Modern AI

Claude: The first copycat, The Pepsi of Modern AI

Mistral: Euro OpenAI

DeepSeek: Chinese OpenAI

Grok/xAI: Republican OpenAI

Google/MSFT: OpenAI clone as a SaaS or Office package.

Meta's Llama: Open Source OpenAI

etc...

replies(4): >>44057097 #>>44057200 #>>44057876 #>>44059035 #

62. amarcheschi ◴[21 May 25 22:54 UTC] No.44057074{3}[source]▶

>>44053952 #

Yup the 1st comment says this https://www.reddit.com/r/LocalLLaMA/comments/1kryybf/mistral...

63. amarcheschi ◴[21 May 25 22:56 UTC] No.44057097[source]▶

>>44057053 #

I think llama is less open source than this mistral release

64. simonw ◴[21 May 25 23:09 UTC] No.44057177{5}[source]▶

>>44056271 #

It's the reason Anthropic called their next release 3.7 Sonnet - the 3.6 version number was already being used by some in the community to refer to their 3.5v2.

65. nylonstrung ◴[21 May 25 23:15 UTC] No.44057200[source]▶

>>44057053 #

Deepseek and Mistral are both more open source than Lllama

replies(1): >>44057514 #

66. knicholes ◴[21 May 25 23:17 UTC] No.44057220{3}[source]▶

>>44054416 #

When I needed 21 3090s and none were available but for ridiculously high prices, I bought Dell Alienware comps, stripped them out, and sold the rest. Definitely made my money back mining for crypto with those cards. Dell surprisingly has a lot of computers with great RTX cards in stock.

67. tuesdaynight ◴[21 May 25 23:19 UTC] No.44057233{4}[source]▶

>>44055956 #

LM Studio will tell you if a specific model is small enough for your available RAM/VRAM.

68. ttoinou ◴[21 May 25 23:28 UTC] No.44057291{5}[source]▶

>>44056796 #

In that case examples of which trips would be interesting so we can take inspiration from you

69. johnQdeveloper ◴[21 May 25 23:50 UTC] No.44057424[source]▶

>>44051733 (OP) #

*For people without a 24GB RAM video card, I've got an 8GB RAM one running this model performs OK for simple tasks on ollama but you'd probably want to pay for an API for anything using a large context window that is time sensitive:*

total duration: 35.016288581s load duration: 21.790458ms prompt eval count: 1244 token(s) prompt eval duration: 1.042544115s prompt eval rate: 1193.23 tokens/s eval count: 213 token(s) eval duration: 33.94778571s eval rate: 6.27 tokens/s

total duration: 4m44.951335984s load duration: 20.528603ms prompt eval count: 1502 token(s) prompt eval duration: 773.712908ms prompt eval rate: 1941.29 tokens/s eval count: 1644 token(s) eval duration: 4m44.137923862s eval rate: 5.79 tokens/s

Compared to an API call that finishes in about 20% of the time it feels a bit slow without the recommended graphics card and what not is all I'm saying.

In terms of benchmarks, it seems unusually well tuned for the model size but I suspect its just a case of gaming the measurement by testing against it as part of the development of the model which is not bad in and of itself since I suspect every LLM who is in this space marketed to IT folks does the same thing tbh so its objective enough given that as a rough gauge of "Is this usable?" without heavy time expense testing it.

replies(1): >>44058748 #

70. idonotknowwhy ◴[21 May 25 23:56 UTC] No.44057452[source]▶

>>44055203 #

I don't believe them either. We really have to test these ourselves imo.

Qwen3 is a step backwards for me for example. And GLM4 is my current goto despite everyone saying it's "only good at html"

The 70b cogito model is also really good for me but doesn't get any attention.

I think it depends on our projects / languages we're using.

Still looking forward to trying this one though :)

71. TZubiri ◴[22 May 25 00:11 UTC] No.44057514{3}[source]▶

>>44057200 #

Will check it out. I like that we are all on the same page that Openness is a numerical value rather than a boolean, the challenge now is how to measure and define it, especially with ML

replies(1): >>44059147 #

72. echelon ◴[22 May 25 00:25 UTC] No.44057569{3}[source]▶

>>44056570 #

I wasn't considering Mistral for anything, but this show of goodwill to open source is amazing. I'll have to give this a try.

replies(1): >>44062135 #

73. ynniv ◴[22 May 25 00:28 UTC] No.44057592{3}[source]▶

>>44056502 #

https://github.com/block/goose

74. waldohatesyou ◴[22 May 25 01:26 UTC] No.44057876[source]▶

>>44057053 #

I don't think they're actually the same thing, I definitely feel like Claude is much better with code than ChatGPT is so there are clearly differences in the capabilities of these models. One analogy that I find helpful here is the idea that these AIs are like animals. Just like there are animals of the same family (meaning they're genetically related to some degree) they still adapt to different niches. I see all these AI companies ultimately creating models analogous to that.

Some AIs will be good at coding (perhaps in a particular language or ecosystem), some at analyzing information and churning out a report for you, and some will be better at operating in physical spaces.

75. jwr ◴[22 May 25 02:24 UTC] No.44058187[source]▶

>>44054028 #

I use qwen3:30b-a3b-q4_K_M for coding support and spam filtering, qwen2.5vl:32b-q4_K_M for image recognition/tagging/describing and sometimes gemma3:27b-it-qat for writing. All through Ollama, as that provides a unified interface, and then accessed from Emacs, command-line llm tool or my Clojure programs.

There is no single "best" model yet, it seems.

That's on an M4 Max with 64GB of RAM. I wish I had gotten the 128GB model, though — given that I run large docker containers that consume ~24GB of my RAM, things can get tight.

76. christophilus ◴[22 May 25 02:36 UTC] No.44058247[source]▶

>>44051733 (OP) #

What hardware are y'all using when you run these things locally? I was thinking of pre ordering the Framework desktop[0] for this purpose, but I wouldn't mind having a decent laptop that could run it (ideally Linux).

[0] https://frame.work/desktop

replies(4): >>44058269 #>>44058281 #>>44058363 #>>44058499 #

77. snitty ◴[22 May 25 02:41 UTC] No.44058269[source]▶

>>44058247 #

I think your options are generally:

0) A desktop PC with one or more graphics cards, or 1) A Mac with Apple Silicon

78. tripplyons ◴[22 May 25 02:42 UTC] No.44058281[source]▶

>>44058247 #

All Hands AI has instructions for running Devstral locally on a MacBook using LMStudio: https://docs.all-hands.dev/modules/usage/llms/local-llms#ser...

The same page also gives instructions for running the model through VLLM on a GPU, but it doesn't seem like it supports quantization, so it may require multiple GPUs since the instructions say "with at least 2 GPUs".

79. sagarpatil ◴[22 May 25 02:44 UTC] No.44058287[source]▶

>>44054477 #

They are referring to SWE bench lite. Just want to make sure you are too.

replies(1): >>44060996 #

80. klooney ◴[22 May 25 03:01 UTC] No.44058363[source]▶

>>44058247 #

AMD is going to be off the beaten path, you're likely to have more success/less boring plumbing trouble with nVidia.

replies(1): >>44058385 #

81. __MatrixMan__ ◴[22 May 25 03:02 UTC] No.44058368{4}[source]▶

>>44056372 #

The kind of coding that happens after the internet goes down. How important of a use case that is depends heavily on why the internet went down.

82. lolinder ◴[22 May 25 03:07 UTC] No.44058385{3}[source]▶

>>44058363 #

Does Nvidia have integrated memory options that allow you to get up to 64GB+ of VRAM without stringing together a bunch of 4090s?

For local LLMs Apple Silicon has really shown the value of shared memory, even if that comes at the cost of raw GPU power. Even if it's half the speed of an array of GPUs, being able to load the mid-sized models at all is a huge plus.

replies(2): >>44058797 #>>44058947 #

83. zackify ◴[22 May 25 03:26 UTC] No.44058473{3}[source]▶

>>44056502 #

I used devstral today with cline and open hands. Worked great in both.

About 1 minute initial prompt processing time on an m4 max

Using LM studio because the ollama api breaks if you set the context to 128k.

replies(2): >>44060526 #>>44062026 #

84. zackify ◴[22 May 25 03:28 UTC] No.44058487{4}[source]▶

>>44055628 #

Ollama breaks for me. If I manually set the context higher. The next api call from clone resets it back.

And ollama keeps taking it out of memory every 4 minutes.

LM studio with MLX on Mac is performing perfectly and I can keep it in my ram indefinitely.

Ollama keep alive is broken as a new rest api call resets it after. I’m surprised it’s this glitched with longer running calls and custom context length.

85. zackify ◴[22 May 25 03:30 UTC] No.44058499[source]▶

>>44058247 #

M4 max 128gb ram.

LM studio MLX with full 128k context.

It works well but has a long 1 minute initial prompt processing time.

I wouldn’t buy a laptop for this, I would wait for the new AMD 32gb gpu coming out.

If you want a laptop I even consider my m4 max too slow to use more than just here or there.

It melts if you run this and battery goes down asap. Have to use it docked for full speed really

replies(3): >>44058814 #>>44062894 #>>44112621 #

86. anonym29 ◴[22 May 25 03:46 UTC] No.44058592[source]▶

>>44051733 (OP) #

I know it's not the recommended runner (OpenHands), but running this on Cline (ollama back-end), it seemed absolutely atrocious at file reading and tool calling.

87. throwaway314155 ◴[22 May 25 04:20 UTC] No.44058748[source]▶

>>44057424 #

> For people without a 24GB RAM video card, I've got an 8GB RAM one running

What're you using for this? llama.cpp? Have a 12GB card (rtx 4070) i'd like to try it on.

replies(1): >>44058765 #

88. johnQdeveloper ◴[22 May 25 04:23 UTC] No.44058765{3}[source]▶

>>44058748 #

https://ollama.com/library/devstral

https://ollama.com/

I believe its just a HTTP wrapper and terminal wrapper around llama.cpp with some modifications/fork.

replies(1): >>44058830 #

89. sofixa ◴[22 May 25 04:29 UTC] No.44058796{4}[source]▶

>>44054925 #

> Ethics are super subjective (and often change based on politics).

That's obviously not true. Ethics often have some nuance and some subjectiveness, but it's not something entirely subjective up to "politics".

Saying this makes it sound like you work at a startup for an AI powered armed drone, and your view of it is 'eh, ethics is subjective, this is fine' when asked how do you feel about responsibility and AI killing people.

replies(3): >>44059019 #>>44061519 #>>44063683 #

90. kookamamie ◴[22 May 25 04:29 UTC] No.44058797{4}[source]▶

>>44058385 #

Not quite, but I do have an Ada 6000, which has 48GB.

91. pram ◴[22 May 25 04:34 UTC] No.44058814{3}[source]▶

>>44058499 #

Yep I have an M4 Max Studio with 128GB of RAM, even the Q8 GGUF fits in memory with 131k context. Memory pressure at 45% lol

92. throwaway314155 ◴[22 May 25 04:38 UTC] No.44058830{4}[source]▶

>>44058765 #

Does ollama have support for cpu offloading?

replies(1): >>44059120 #

93. karolist ◴[22 May 25 05:03 UTC] No.44058947{4}[source]▶

>>44058385 #

RTX Pro 6000 Blackwell has 96GB VRAM.

replies(1): >>44061498 #

94. ◴[22 May 25 05:11 UTC] No.44058973[source]▶

>>44051733 (OP) #

95. dragonwriter ◴[22 May 25 05:23 UTC] No.44059019{5}[source]▶

>>44058796 #

> Ethics often have some nuance and some subjectiveness, but it's not something entirely subjective up to "politics".

Ethics are entirely subjective, as is inherently true of anything that supports "should" statements, because to justify any should statement, you need another "should" statement, you can never rest should entirely on "is" (you can, potentially, reset any entire system of "should" one root "should" axiom, though in practice most systems have more than one root axiom.)

And the process of coming to social consensus on a system of ethics is precisely politics.

You can dislike that this is true, but it is true.

> Saying this makes it sound like you work at a startup for an AI powered armed drone, and your view of it is 'eh, ethics is subjective, this is fine' when asked how do you feel about responsibility and AI killing people.

Understanding that ethics is subjective does not mean that one does not have a strong ethical framework that they adhere to. It just means that one understands the fundamental nature of ethics and the kind of propositions that ethical propositions inherently are.

Understanding that ethics are subjective does not, in other words, imply the belief that all beliefs about ethics (or, a fortiori, matters that are inherently subjective more generally) are of equal moral/ethical merit.

replies(1): >>44059196 #

96. anon373839 ◴[22 May 25 05:28 UTC] No.44059035[source]▶

>>44057053 #

I think this just indicates that OpenAI's branding and marketing efforts worked on you?

97. jwr ◴[22 May 25 05:29 UTC] No.44059039[source]▶

>>44051733 (OP) #

My experience with LLMs seems to indicate that the benchmark numbers are more and more detached from reality, at least my reality.

I tested this model with several of my Clojure problems and it is significantly worse than qwen3:30b-a3b-q4_K_M.

I don't know what to make of this. I don't trust benchmarks much anymore.

replies(1): >>44059264 #

98. johnQdeveloper ◴[22 May 25 05:49 UTC] No.44059120{5}[source]▶

>>44058830 #

> Does ollama have support for cpu offloading?

https://www.reddit.com/r/ollama/comments/1df757o/high_cost_o...

https://github.com/ollama/ollama/issues/8291

Yes.

replies(1): >>44059677 #

99. kergonath ◴[22 May 25 05:52 UTC] No.44059135{4}[source]▶

>>44056372 #

A LLM that I don’t host is a non-starter, so even a “subpar LLM” is better than someone’s cloud.

100. kergonath ◴[22 May 25 05:56 UTC] No.44059147{4}[source]▶

>>44057514 #

Well, Llama’s licence says that I am not allowed to use it. It does not take much to be more open then that.

101. sofixa ◴[22 May 25 06:06 UTC] No.44059196{6}[source]▶

>>44059019 #

> Ethics are entirely subjective

That's obviously not true. Is it ethical to have slaves or kill children? No, objectively, it is not.

replies(2): >>44059309 #>>44059482 #

102. rahimnathwani ◴[22 May 25 06:09 UTC] No.44059216[source]▶

>>44053886 #

Almost all models listed in the ollama model library have a version that's under 20GB. But whether that's a 4-bit quantization (as in this case) or more/fewer bits varies.

AFAICT they usually set the default tag to sa version around 15GB.

103. NitpickLawyer ◴[22 May 25 06:18 UTC] No.44059264[source]▶

>>44059039 #

How did you test this? Note that this is not a regular coding model (i.e. write a function that does x). This is a fine-tuned model specifically post-trained on a cradle (open hands, ex open devin). So their main focus was to enable the "agentic" flows, with tool use, where you give the model a broad task (say a git ticket) and it starts by search_repo() or read_docs(), followed by read_file() in your repo, then edit_file(), then run_tests() and so on. It's intended to first solve those problems. They suggest using it w/ open hands for best results.

Early reports from reddit say that it also works in cline, while other stronger coding models had issues (they were fine-tuned more towards a step-by-step chat with a user). I think this distinction is important to consider when testing.

replies(3): >>44060980 #>>44064209 #>>44070237 #

104. Eupolemos ◴[22 May 25 06:27 UTC] No.44059309{7}[source]▶

>>44059196 #

It was in Rome.

You are objectively using objectively wrong :-P (sorry, couldn't resist)

Disclaimer: I agree that having slaves or killing children is very wrong.

105. asimovDev ◴[22 May 25 06:28 UTC] No.44059316{3}[source]▶

>>44056502 #

you can use ollama in VS Code's copilot. I haven't personally tried it but I am interested in how it would perform with devstral

106. jgtrosh ◴[22 May 25 06:55 UTC] No.44059447{3}[source]▶

>>44056791 #

https://www.datacenterdynamics.com/en/news/french-data-cente... this Eclairion colo currently being built south of Paris, mostly for Mistral has received some public money (incl. 3M€ from the region https://www.iledefrance.fr/toutes-les-actualites/ia-un-super...)

107. sneak ◴[22 May 25 07:02 UTC] No.44059482{7}[source]▶

>>44059196 #

You are getting morals and ethics confused.

Ethics are an objective analysis framework. Morals are subjective.

There are plenty of ethical frameworks where those examples you give objectively evaluate to “yes”.

108. visarga ◴[22 May 25 07:15 UTC] No.44059545{3}[source]▶

>>44056458 #

> An 8GB Macbook could use Gemma 3 4B at 2.5GB+1GB, but this is probably not worth doing.

I am currently using this model on a Macbook with 16GB ram, it is hooked up with a chrome extension that extracts text from webpages and logs to a file, then summarizes each page. I want to develop an episodic memory system, like MS Recall, but local, it does not leak my data to anyone else, and costs me nothing.

Gemma 3 4B runs under ollama and is light enough that I don't feel it while browsing. Summarization happens in the background. This page I am on is already logged and summarized.

109. taneq ◴[22 May 25 07:43 UTC] No.44059677{6}[source]▶

>>44059120 #

A perfect blend of LMGTFY and helpfulness. :)

replies(1): >>44059984 #

110. sneak ◴[22 May 25 07:43 UTC] No.44059683[source]▶

>>44053211 #

I just ran it on a 24GB M2 air. Slow, but functional.

111. ranguna ◴[22 May 25 07:46 UTC] No.44059706{4}[source]▶

>>44056746 #

I'm not sure where you are getting these prices from, but the cheapest 5090 I can find is 2755 on amazon Germany from the gigabyte store.

replies(2): >>44061426 #>>44079788 #

112. davedx ◴[22 May 25 08:15 UTC] No.44059888[source]▶

>>44053886 #

I couldn’t run it on my 16gb MBP (I tried, it just froze up, probably lots of swapping), they say it needs 32gb

replies(2): >>44063369 #>>44064109 #

113. johnQdeveloper ◴[22 May 25 08:29 UTC] No.44059984{7}[source]▶

>>44059677 #

lol. I try not to be a total asshole, it sometime even works! :)

Good luck to you mate with your life :)

114. turing_complete ◴[22 May 25 08:40 UTC] No.44060064{5}[source]▶

>>44056271 #

because nobody says that

replies(2): >>44061275 #>>44062804 #

115. paulbjensen ◴[22 May 25 09:14 UTC] No.44060286{4}[source]▶

>>44054828 #

I managed to install and run it on my Razer Edge Laptop with an Nvidia RTX 4080, using Ollama.

It works but the tokens per sec is very slow. It did complete a TypeScript task example succinctly.

116. elAhmo ◴[22 May 25 10:08 UTC] No.44060526{4}[source]▶

>>44058473 #

How is it great that it takes 1 minute for initial prompt processing?

replies(2): >>44080572 #>>44112119 #

117. thih9 ◴[22 May 25 10:25 UTC] No.44060624[source]▶

>>44051733 (OP) #

> Devstral excels at using tools to explore codebases

As an AI and vibe coding newbie, how does that work? E.g. how would I use devstral and ollama and instruct it to use tools? Or would I need some other program as well?

replies(2): >>44061016 #>>44062244 #

118. desdenova ◴[22 May 25 11:29 UTC] No.44060980{3}[source]▶

>>44059264 #

I did a very simple tool calling test and it was simply unable to call the tool and use the result.

Maybe it's specialized to use just a few very specific tools? Is there some documentation on how to actually set it up without requiring some weird external platform?

119. svantana ◴[22 May 25 11:32 UTC] No.44060996{3}[source]▶

>>44058287 #

Where did you get that idea? In the post they are repeatedly referring to SWEBench-Verified and nothing else.

replies(1): >>44093810 #

120. desdenova ◴[22 May 25 11:36 UTC] No.44061016[source]▶

>>44060624 #

In the Ollama API, you use the "tools" parameter to describe the available tools to the model, then use the "tool_calls" from the response to call the functions and send the results back to the model using "role": "tool".

Most of this is handled very easily by the ollama-python library, so you can integrate tool calling very simply in any script.

That said, this specific model was unable to call the functions and use the results in my "hello world" tests, so it seems it expects a few very specialized tools to be provided, which are defined by that platform they're advertising.

Right now the best tool calling model I've used is still qwen3, it works very reliably, and I can give it any ability I want and it'll use it when expected, even in /no_think mode.

121. moffkalast ◴[22 May 25 11:42 UTC] No.44061050{4}[source]▶

>>44056250 #

The model formerly known as Claude 3.6 Sonnet?

122. NiloCK ◴[22 May 25 12:13 UTC] No.44061275{6}[source]▶

>>44060064 #

Anthropic moved from 3.5, to 3.5(new), to 3.7. They skipped 3.6 because of usage in the community, and because 3.5(newer) probably passed some threshold of awfulness.

People also use 3.5.1 to refer to 3.5(new)/3.6.

The remaining difficulty now is when people refer to 3.5, without specifying (new) or (old). I find most unspecified references to 3.5 these days are actually to 3.6 / 3.5.1 / 3.5(new), which is confusing.

123. ttoinou ◴[22 May 25 12:33 UTC] No.44061426{5}[source]▶

>>44059706 #

The vendor disappeared a few hours after my comment and now his Amazon store doesn't exist anymore, replaced by another vendor with no sales but normal prices :) . Even if the seller was written as a german company "Wundshop" it was only registered on Amazon Spain

124. lolinder ◴[22 May 25 12:43 UTC] No.44061498{5}[source]▶

>>44058947 #

It also costs 4x the entire Framework Desktop for just the card. If you're doing something professional that's probably worth it, but it's not a clear winner in the enthusiast space.

125. sebzim4500 ◴[22 May 25 12:45 UTC] No.44061519{5}[source]▶

>>44058796 #

I don't work in defence, but I would be proud to work for a drone manufacturer whose drones were defending Ukraine, for example.

126. Kerrick ◴[22 May 25 13:44 UTC] No.44062005[source]▶

>>44054351 #

A Mac Mini with 32GB RAM costs $999. Just start with the base model and don't upgrade the CPU, GPU, SSD, or Ethernet port.

127. nico ◴[22 May 25 13:47 UTC] No.44062026{4}[source]▶

>>44058473 #

Have you tried using mlx or Simon Wilson’s llm?

https://llm.datasette.io/en/stable/

https://simonwillison.net/tags/llm/

replies(1): >>44112112 #

128. qeternity ◴[22 May 25 13:59 UTC] No.44062135{4}[source]▶

>>44057569 #

Mistral have a long history of open weight models...

replies(1): >>44070156 #

129. mekpro ◴[22 May 25 14:11 UTC] No.44062244[source]▶

>>44060624 #

it can use tool to explore directory like ls grep out of the box.

130. screye ◴[22 May 25 14:24 UTC] No.44062368[source]▶

>>44051733 (OP) #

What's the play for smaller base model training companies like Mistral ?

Mistral's positioning as the European alternative doesn't seem to be sticking. Acquisition seems tricky given how inflection, character.ai and stability have got carved out. The big acquisition bucks are going to product companies (windsurf)

They could pivot up the stack, but then they'd be starting from scratch with a team that's ill-suited for product development.

The base model offerings from pretraining companies have been surprisingly myopic. Deepmind seems to be the only one going past the obvious "content gen/coding automation" verticals. There's a whole world out there. LLM product companies are fast acquiring pieces of the real money pie and smaller pretraining companies are getting left out.

______

edit: my comment rose to the top. It's early in the morning. Injecting a splash of optimism.

LLMs are hard, and giants like Meta are struggling to make steady progress. Mistrals models are cheap, competent, open-source-ish and don't come with AGI-is-imminent baggage. Good enough for me.

To my own question: They have a list of target industries at the top. https://mistral.ai/solutions#industry

Good luck to them.

replies(1): >>44062614 #

131. gunalx ◴[22 May 25 14:45 UTC] No.44062614[source]▶

>>44062368 #

I actually think the underdog, cheap, but still capable independent european alternative is a decent selling point. They have also branched out into specialised models, and custom training. as well as their ocr service.

132. gunalx ◴[22 May 25 14:51 UTC] No.44062682[source]▶

>>44054969 #

mistral small 3.1 is also apache

133. skerit ◴[22 May 25 15:03 UTC] No.44062804{6}[source]▶

>>44060064 #

That's not correct. I have always referred to it as v3.6, and I've seen plenty of other people do so too. It's why their next model was called v3.7

134. discordance ◴[22 May 25 15:12 UTC] No.44062894{3}[source]▶

>>44058499 #

How many tokens per second are you both getting?

135. ◴[22 May 25 16:01 UTC] No.44063369{3}[source]▶

>>44059888 #

136. dismalaf ◴[22 May 25 16:33 UTC] No.44063683{5}[source]▶

>>44058796 #

Is it always wrong to kill people? If you say yes, then you are also saying it's wrong to defend yourself from people who are trying to kill you.

This is what I mean by subjective.

And then since Google is beholden to US laws, if the US government suddenly decides that helping Ukraine to defend itself is wrong, but you personally believe defending Ukraine is right, suddenly you have a problem...

137. twotwotwo ◴[22 May 25 16:36 UTC] No.44063717[source]▶

>>44051733 (OP) #

Any company in this space outside of the top few should be contributing to the open-source tools (Aider, OpenHands, etc.); that is a better bet than making your own tools from scratch to compete with ones from much bigger teams. A couple folks making harnesses work better with your model might yield improvements faster than a lot of model-tuning work, and might also come out of the process with practical observations about what to work on in the next spin of the model.

Separately, deploying more autonomous agents that just look at an issue or such just seems premature now. We've only just gotten assisted flows kind-of working, and they still get lost--get stuck on not-hard-for-a-human debugging tasks, implement Potemkin 'fixes', forget their tools, make unrelated changes that sometimes break stuff, etc.--in ways that imply that flow isn't fully-baked yet.

Maybe the main appeal is asynchrony/potential parallelism? You could tackle that different ways, though. And SWEBench might be a good benchmark still (focus on where you want to be, even if you aren't there yet), but that doesn't mean it represents the most practical way to use these tools day-to-day currently.

138. mrshu ◴[22 May 25 17:00 UTC] No.44064049{3}[source]▶

>>44056502 #

ra-aid works pretty well with Ollama (haven't tried it with Devstral yet though)

https://docs.ra-aid.ai/configuration/ollama/

139. ics ◴[22 May 25 17:04 UTC] No.44064109{3}[source]▶

>>44059888 #

I was able to run it on my M2 Air with 24GB. Startup was very slow but less than 10 minutes. After that responses were reasonably quick.

Edit: I should point out that I had many other things open at the time. Mail, Safari, Messages, and more. I imagine startup would be quicker otherwise but it does mean you can run with less than 32GB.

140. tasuki ◴[22 May 25 17:14 UTC] No.44064209{3}[source]▶

>>44059264 #

> "write a function that does x"

Which model is optimized to do that? This is what I want out of LLMs! And also talking high level architecture (without any code) and library discovery, but I guess the general talking models are good for that...

141. alhimik45 ◴[23 May 25 05:37 UTC] No.44070156{5}[source]▶

>>44062135 #

But at the same time they don't open weights of Codestral...

142. jwr ◴[23 May 25 05:54 UTC] No.44070237{3}[source]▶

>>44059264 #

I didn't actually even test tool calling. I have two test cases that I use for all models: one is a floating-point equality function, which is quite difficult to get right, and another is a core.async pack-into-batches! function which has the following docstring:

  "Take items from `input-ch` and group them into `batch-size` vectors. Put these onto `output-ch`. Once items
  start arriving, if `batch-size` items do not arrive within `inactivity-timeout`, put the current incomplete
  batch onto `output-ch`. If an anomaly is received, passes it on to `output-ch` and closes all channels. If
  `input-ch` is closed, closes `output-ch`.

  If `flush-predicate-fn` is provided, it will get called with two parameters: the currently accumulated
  batch (guaranteed to have at least one item) and the next item. If the function returns a truthy value, the
  batch will get flushed immediately.

  If `convert-batch-fn` is provided, it will get called with the currently accumulated batch (guaranteed to
  have at least one item) and its return value will be put onto `output-ch`. Anomalies bypass
  `convert-batch-fn` and get put directly onto `output-ch` (which gets closed immediately afterwards)."

In other words, not obvious.

I ask the model to review the code and tell me if there are improvements that can be made. Big (online) models can do a pretty good job with the floating point equality function, and suggest something at least in the ballpark for the async code. Small models rarely get everything right, but some of their observations are good.

143. ivanvanderbyl ◴[23 May 25 10:35 UTC] No.44071582{3}[source]▶

>>44056502 #

I’ve been playing around with Zed, supports local and cloud models, really fast, nice UX. It does lack some of the deeper features of VSCode/Cursor but very capable.

144. Ad3lio36 ◴[24 May 25 09:00 UTC] No.44079788{5}[source]▶

>>44059706 #

On the 21st I bought a 5080 from that seller "Wundshop", today is the 24th and there is still no progress on the status of the package (nor will there be any more progress). I contacted Amazon for them to investigate or do something, but they told me to wait until the 28th (which is the last day they have for me to receive the package).

Don't they supposedly have to have the item in Amazon's warehouse to sell it?

145. cheema33 ◴[24 May 25 12:12 UTC] No.44080572{5}[source]▶

>>44060526 #

That time is just for the very first prompt. It is basically the startup time for the model. Once it is loaded, it is much much faster in responding to your queries. Depending on your hardware of course.

146. segmondy ◴[24 May 25 13:42 UTC] No.44081068{4}[source]▶

>>44056637 #

People have all sorts of hardware, TPS is meaningless without the full spec of the hardware, and GPU is not the only thing, CPU, ram speed, memory channel, PCIe speed, inference software, partial CPU offload? RPC? even OS, all of these things add up. So if someone tells you TPS for a given model, it's meaningless unless you understand their entire setup.

147. sagarpatil ◴[26 May 25 03:47 UTC] No.44093810{4}[source]▶

>>44060996 #

Sorry. I was wrong.

148. zackify ◴[28 May 25 01:44 UTC] No.44112112{5}[source]▶

>>44062026 #

On lm studio I was using mlx

149. zackify ◴[28 May 25 01:45 UTC] No.44112119{5}[source]▶

>>44060526 #

Haha great as in surprisingly good at some simple things that nothing has been able to do locally for me.

The 1 minute first token sucks and has me dreaming for the day of 3-4x the bandwidth

150. bicepjai ◴[28 May 25 03:59 UTC] No.44112621{3}[source]▶

>>44058499 #

Do you also have tokens per second metric ?

↑