Most active commenters
  • AnthonyMouse(10)
  • foobiekr(8)
  • yumraj(7)
  • (5)
  • acdha(4)
  • logicchains(4)
  • bugglebeetle(3)
  • iaw(3)
  • easygenes(3)
  • slt2021(3)

255 points tbruckner | 147 comments | | HN request time: 3.243s | source | bottom
1. regularfry ◴[] No.37420089[source]
4-bit quantised model, to be precise.

When does this guy sleep?

replies(5): >>37420405 #>>37420746 #>>37421224 #>>37421721 #>>37423012 #
2. homarp ◴[] No.37420186[source]
https://www.reddit.com/r/LocalLLaMA/comments/16bynin/falcon_... has some more data like sample answers with various level of quantizations

and https://huggingface.co/TheBloke/Falcon-180B-Chat-GGUF if you want to try

3. logicchains ◴[] No.37420200[source]
Pretty amazing that in such a short span of time we went from people being amazed how powerful GPT3.5 was upon its release to people being able to run something equivalently powerful locally.
replies(1): >>37432698 #
4. sbierwagen ◴[] No.37420216[source]
The screenshot shows a working set size of 147,456 mb, so he's using the mac studio with 192 gb of ram?
replies(1): >>37420327 #
5. pella ◴[] No.37420249[source]
Is this an M2 Ultra with 192 GB of unified memory, or the standard version with 64 GB of unified memory?
replies(1): >>37420741 #
6. ◴[] No.37420327[source]
7. rvz ◴[] No.37420331[source]
Totally makes sense for C++ or Rust based AI models for inference instead of the over-bloated networks run on Python with sub-optimal inference and fine-tuning costs.

Minimal overhead or zero cost abstractions around deep learning libraries implemented in those languages gives some hope that people like ggerganov are not afraid of the 'don't roll your own deep learning library' dogma and now we can see the results as to why DL on the edge and local AI, is the future of efficiency in deep learning.

We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.

replies(5): >>37420484 #>>37420605 #>>37420734 #>>37421354 #>>37422072 #
8. ◴[] No.37420405[source]
9. adam_arthur ◴[] No.37420461[source]
Even a linear growth rate of average RAM capacity would obviate the need to run current SOTA LLMs remotely in short order.

Historically average RAM has grown far faster than linear, and there really hasn't been anything pressing manufacturers to push the envelope here in the past few years... until now.

It could be that LLM model sizes keep increasing such that we continue to require cloud consumption, but I suspect the sizes will not increase as quickly as hardware for inference.

Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing

replies(8): >>37420537 #>>37420948 #>>37421196 #>>37421214 #>>37421497 #>>37421862 #>>37421945 #>>37424918 #
10. ◴[] No.37420484[source]
11. sbierwagen ◴[] No.37420490[source]
M2 Mac Studio with 192gb of ram is US$5,599 right now.
replies(3): >>37420616 #>>37420693 #>>37427799 #
12. tomohelix ◴[] No.37420537[source]
RAM is easy. The hard part is making the unified memory SOC like Apple's. From what I know, Apple performance is almost magic. And whatever Apple is making, they are at peak capacity already and they can't make more even if they want to. Nobody else has a comparable technology. Apple is in its own league.
replies(1): >>37426757 #
13. randomopining ◴[] No.37420598[source]
Is there any actual usecases to run this stuff on a local computer? Or are most of these models actually suited to run on remote clusters?
replies(3): >>37421417 #>>37421858 #>>37423070 #
14. brucethemoose2 ◴[] No.37420605[source]
The actual inference is not run in Python in PyTorch, and its usually not bottlenecked by it.

The problem is CUDA, not Python.

LLMs are uniquely suited to local inference in projects like GGML because they are so RAM bandwidth heavy (and hence relatively compute lite), and relatively simple. Your kernel doesn't need to be hyper optimized by 35 Nvidia engineers in 3 stacks before its fast enough to start saturating the memory bus generating tokens.

And yet its still an issue... For instance, llama.cpp is having trouble getting prompt ingestion performance in a native implementation comparable cuBLAS, even though they theoretically have a performance advantage by using the quantization directly.

15. ◴[] No.37420616{3}[source]
16. piskov ◴[] No.37420725{4}[source]
You do understand that you can connect thunderbolt external storage (not just usb3 one)?
replies(1): >>37420784 #
17. survirtual ◴[] No.37420734[source]
Python is generally just the glue language for underlying, highly optimized c++ libs. The improvements aren't just about languages. I would imagine facebook is less focused on inference, so didn't bother to make a highly optimized LLM inference engine. There also just isn't a business case for CPU-bound LLMs at an enterprise scale, so why code for that? Additionally, llama.cpp can be called by python and python could still do all the glue.

There is no language war. Use whatever tool is necessary to achieve effective results for accomplishing the mission.

18. zargon ◴[] No.37420741[source]
4-bit quantized 180B will not fit in 64GB. You'll need need over 100 GB for that.
19. ◴[] No.37420746[source]
20. diffeomorphism ◴[] No.37420784{5}[source]
That does not really make 1tb of non-upgradable storage in a $5k+ device any less ridiculous though.
replies(1): >>37420864 #
21. yumraj ◴[] No.37420789{4}[source]
It’s not useless.

It seems a Thunderbolt/USB4 external NVME enclosure can do about 2500-3000 MB/s which is about half of internal SSD. So not at all bad. It’ll just add an additional few tens of seconds while loading the model. Totally manageable.

Edit: in fact this is the proper route anyway since it allows you to work with huge model and intermediate FP16/FP32 files while quantizing. Internal storage, regardless of how much, will run out quickly.

replies(1): >>37420889 #
22. yumraj ◴[] No.37420864{6}[source]
That is true, but a whole separate discussion.

It applies to RAM too. My 32GB Mac Studio seemed pretty good before the LLMs.

23. superkuh ◴[] No.37420889{5}[source]
>Internal storage, regardless of how much, will run out quickly.

This only applies to Macs and Mac-a-likes. Actual desktop PCs have many SATA ports and can store reasonable amounts of data without the crutch of external high latency storage making things iffy. I say this as someone with TBs of llama models on disk and I do quantization myself (sometimes).

BTW my computer cost <$900 w/17TB of storage currently and can run up to 34B 5bit llm. I could spend $250 more to upgrade to 128GB of DDR4 2666 ram and run the 65B/70B but 180B is out of the range. You do have to spend big money for that.

replies(4): >>37421057 #>>37421079 #>>37421096 #>>37422593 #
24. ls612 ◴[] No.37420948[source]
For me the test is; when will a Siri-LLM be able to run locally on my iPhone at at least GPT-4 levels? 2030? Farther out? Never because of governments forbidding it? To what extent will improvements be driven by the last gasps of Moore’s Law vs by improving model architectures to be more efficient?
replies(3): >>37420983 #>>37421670 #>>37422133 #
25. bananapub ◴[] No.37420958{4}[source]
> gets price wrong

> is corrected

> pivots to some weird rant about 1TB of extremely high performance flash being "useless"

wouldn't it have saved time to just not post either of these comments?

replies(2): >>37421235 #>>37421807 #
26. YetAnotherNick ◴[] No.37420966[source]
You just need 4 3090($4000) to run it. And 4 3090 are definitely lot more powerful and versatile than an M2 mac.
replies(3): >>37421025 #>>37421444 #>>37424271 #
27. adam_arthur ◴[] No.37420983{3}[source]
Given that phones are a few years behind PCs on RAM, likely whenever the average PC can do it, plus a few years. There are phones out there with 24GB of RAM already, it looks like.

Of course battery life would be a concern there, so I think LLM usage on phones will remain in the cloud.

Haven't studied phone RAM capacity growth rates in detail though

replies(2): >>37421363 #>>37425019 #
28. yumraj ◴[] No.37421025{3}[source]
How much would that system cost, if you could easily buy those GPUs
replies(2): >>37421217 #>>37421964 #
29. andromeduck ◴[] No.37421057{6}[source]
Who TF is still using SATA with SSDs?!
30. yumraj ◴[] No.37421079{6}[source]
We’re talking about 192GB of GPU accessible memory here.

Or are you comparing with CPU inference? In which case apples-oranges.

How much do GPUs with 192GB of RAM cost?

Edit: also I think (unverified) very very few systems have multiple PCI 3/4 NVME slots. There are companies with PCI cards that can take NVMEs but that’ll in itself cost, without NVMEs, more than your $900 system.

replies(1): >>37421909 #
31. LTL_FTC ◴[] No.37421096{6}[source]
“external USB3 SSD... slowly” so which is it? Sata ports aren’t exactly faster than usb3. If you want speed you need pcie drives. Not sata. Thunderbolt is a great solution. Plus, my network storage sustains 10Gb networking. There are other avenues
32. tiffanyh ◴[] No.37421142[source]

  system_info: n_threads = 4 / 24
Am I seeing correctly in the video that this ran on only 4 threads?
replies(1): >>37422416 #
33. ramesh31 ◴[] No.37421196[source]
>I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers. There's not going to be much moat or differentiation to defend margins... more of a race to the bottom on pricing

Should be pointed out that this didn't just happen out of thin air. These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.

replies(1): >>37421344 #
34. MuffinFlavored ◴[] No.37421214[source]
> Given how useful GPT-4 is already. Maybe one more iteration would unlock the vast majority of practical use cases.

Unless I'm misunderstanding, doesn't OpenAI have a very vested interest to keep making their products so good/so complex/so large that consumer hobbyists can't just `git clone` an alternative that's 95% as good running locally?

replies(3): >>37421454 #>>37421498 #>>37421783 #
35. PartiallyTyped ◴[] No.37421217{4}[source]
Lanes will probably be an issue, so a threadripper pro or an epyc cpu, add half a grand at least for the motherboard and it’s starting to look grim.
replies(1): >>37421346 #
36. ramesh31 ◴[] No.37421224[source]
>When does this guy sleep?

I don't think he has since July.

37. oefrha ◴[] No.37421235{5}[source]
Note to other commenters: a simple “sorry, I was wrong” is more graceful and less embarrassing.
38. logicchains ◴[] No.37421344{3}[source]
>These open models still cost millions of dollars to create. Meta let the genie out of the bottle, but it won't be free forever.

This particular model was funded by the UAE government. If they could do it, it should be similarly possible for a western government to create and release one as a public good.

39. thfuran ◴[] No.37421346{5}[source]
And that's before you even get your first power bill.
replies(2): >>37421385 #>>37422182 #
40. PartiallyTyped ◴[] No.37421354[source]
Python is not really the bottleneck in LLM applications. It is for tabular RL, but certainly not for deep RL (i have had discussions with DM folk over this in r/RL, and the ppl from stable diffusion).

The problem is the bus, cuda, and the sheer volume of data that need to be transferred.

Pytorch itself is actually a wrapper around torchlib, which is written in C++.

The compilation step of PyTorch 2.0 provides a sizeable improvement, but not 2 orders of magnitude as you’d expect from python to c++ migrations. The compilation is due to the backend more so than python itself. See Triton for example.

41. nico ◴[] No.37421363{4}[source]
That’s for LLMs, but at the same time, there are other types of models coming out

Wouldn’t be surprised if we get small models that can run locally on a phone and just retrieve data from the network as needed (without sending your data out), within the next couple of years

42. PartiallyTyped ◴[] No.37421385{6}[source]
hey, at least you will cut down on the heating costs!
43. logicchains ◴[] No.37421417[source]
The use-case is you want to generate pornographic, violence-depicting or politically-incorrect content, and would rather buy a powerful computer than rent a server (or you already own a powerful computer).
replies(2): >>37423085 #>>37440704 #
44. mk_stjames ◴[] No.37421444{3}[source]
The data buffer size shown by Georgi here is 96GB, plus there is the other overhead; it states the recommended max working set size for this context is 147GB, so no Flacon 180B in Q4 as shown wouldn't fit on 4x 24GB 3090's (96GB VRAM).

But I'm also in the quad-3090 build idea stage as well and bought 2 with the intention to go up to 4 eventually for this purpose. However, since I bought my first 2 a few months back (at about 800 euro each!) the ebay prices have actually gone up... a lot; I purchased a specific model that I thought would be plentiful as I had found a seller with a lot of them from OEM pulls, and they were taking good offers- and suddenly they all got sold. I feel like we are entering another GPU gap like 2020-2021.

Based on the performance of Llama2 70B, I think 96GB of vram and the cuda core count x bandwidth of 4 3090's will hit a golden zone as far as price-performance of a deep learning rig that can do a bit of finetuning on top of just inference.

Unless A6000 prices (or A100 prices) start plummeting.

My only hold out is the thought that maybe nvidia releases a 48gb Titan-type card at a less-than-A6000 price sometime soon, which will shake things up.

replies(2): >>37421913 #>>37452138 #
45. Frannyies ◴[] No.37421454{3}[source]
They have a huge cost incentive to optimize it for runtime.

The magic of openai is their training data and architecture.

There is a real risk that a model gets leaked.

replies(1): >>37421998 #
46. cs702 ◴[] No.37421497[source]
I agree: No one has any technological advantage when it comes to LLMs anymore. Some companies, like OpenAI, may have other advantages, like an ecosystem of developers. But most of the gobs of money that so many companies have burned to train giant proprietary models is unlikely to see any payback.

What I think will happen is that more companies will come to the realization it's in their best interest to open their giant models. The cost of training all those giant models is already a sunk cost. If there's no profit to be made by keeping a model proprietary, why not open it to gain or avoid losing mind-share, and to mess with competitors' plans?

First, it was LLaMA, with up to 65B params, opened against Meta's wishes. Then, it was LLaMA 2, with up to 70B params, opened by Meta on purpose, to mess with Google's and Microsoft/OpenAI's plans. Now, it's Falcon 180B. Like you, I'm wondering, what comes next?

replies(4): >>37421627 #>>37422256 #>>37424763 #>>37429907 #
47. chongli ◴[] No.37421498{3}[source]
What is OpenAI's moat? Loads of people outside the company are working on alternative models. They may have a lead right now but will it last a few years? Will it even last 6 months?
replies(4): >>37421647 #>>37421649 #>>37421665 #>>37422380 #
48. bugglebeetle ◴[] No.37421627{3}[source]
I think it’s the opposite. Models will become more commoditized and closed/invisible as the basis of other service offerings. Apple isn’t going to start offering general API access to the model they’re training, but will bake it into a bunch of stuff and maybe give platform developers limited access. Meta will probably continue to drive the commoditization train because they have a killer ML/AI team, but the same thing will likely happen there once it’s the basis for a service that generates money.
replies(2): >>37422273 #>>37422892 #
49. yumraj ◴[] No.37421647{4}[source]
> What is OpenAI's moat?

There’s none. Which is why Sam Altman has been crying wolf, in hope of regulatory barriers which can provide it the moat.

50. MuffinFlavored ◴[] No.37421649{4}[source]
> What is OpenAI's moat?

From what I understand, if you take the absolute best cutting edge LLM with the most parameters and the most up to date model from GitHub/HuggingFace/whatever, it's very far off from the output you get from GPT-3.5 / GPT-4

aka full of hallucinations, not very useful

I don't know if this is the right way to look at it but if what George Hotz said about GPT-4 simply being "8 220B parameter models glued together by something called a mixture-of-experts", from what I understand, OpenAI's moat is:

their access/subsidiized cost to GPUs/infrastructure with Microsoft

the 8 220B models they have are really good/I don't think anything open source matches them/nobody can download "all of Reddit/Twitter/Wikipedia/StackOverflow/whatever else they trained on" anymore like they could given how everybody wants to protect/monetize their content now

and then the "router" / "MoE" piece seems to be something missing from open source offerings as well

replies(3): >>37421962 #>>37422981 #>>37426850 #
51. ben_w ◴[] No.37421665{4}[source]
OpenAI's "moat" is basically the same as Adobe's or Microsoft's, give or take a metaphor, for Photoshop or Office.

Although see last week for previous responses: https://news.ycombinator.com/item?id=37333747

52. bugglebeetle ◴[] No.37421670{3}[source]
Apple is already training their own LLM to rival GPT-4, so I doubt it will take that long.
53. esafak ◴[] No.37421721[source]
With his name recognition he could easily raise over $10m in funding for a seed round and sleep well, if he wanted.
replies(1): >>37423055 #
54. m3kw9 ◴[] No.37421727[source]
OpenAIs moat will soon largely be UX. Anyone can do plugins, code etc but when operating by everyday users the best UX wins after LLM becomes commodified. Just look at stand alone digital cameras vs mobile phone cams from Apple.
replies(3): >>37422623 #>>37423170 #>>37424079 #
55. reckless ◴[] No.37421783{3}[source]
Indeed they do, however companies like Meta (altruistically or not) are preventing OpenAI from building 'moats' by releasing models and architecture details in a very public way.
replies(2): >>37422263 #>>37422288 #
56. m3kw9 ◴[] No.37421791[source]
I just rather pay 20 a month to get a share of 10000 h100 running the benchmark LLM instead
57. acdha ◴[] No.37421807{5}[source]
Yeah, I flagged that because it’s basically indistinguishable from trolling. All it’s going to do is distract from the actual thread topic - nobody is going to learn something useful.
replies(1): >>37422339 #
58. ErneX ◴[] No.37421852{4}[source]
You can plug an NVME thunderbolt caddy instead, it won’t reach a good NVME SSD top speeds but it will hover around 2800MB/s r+w.

Its internal SSD at 1TB or greater capacity is at least twice as fast.

59. acdha ◴[] No.37421858[source]
Here’s a simple one: corporate policy doesn’t allow you to send company data to a cloud service. There are a ton of people with significant budgets in that situation.
replies(1): >>37424659 #
60. visarga ◴[] No.37421862[source]
> I think people will be surprised that consumers ultimately end up benefitting far more from LLMs than the providers.

LLMs make possible the great skill sharing, they are learning from some people through web and books, and then assist other people in their particular problems. This level of sharing and customisation is even greater and more accessible than open source.

replies(1): >>37423105 #
61. superkuh ◴[] No.37421909{7}[source]
Yes, CPU inference. For llama.cpp with Apple M1/M2 the GPU inference (via metal) is about 5x faster than CPU for text generation and about the same speed for prompt processing. Not insignificant but not giant either.

You generally can't hook up large storage drives to nvme. Those are all tiny flash storage. I'm not sure why you brought it up.

replies(1): >>37422034 #
62. iaw ◴[] No.37421913{4}[source]
I built an x4 3090 rig a little while ago. There are a few hurdles:

1) Need 2 power supplies to run at peak power and most US outlets can't handle it

2) If you're running on 1 power supply need to limit power consumption and clock speed of the cards to prevent the PSU fuse from popping

3) To get all 4 cards running at once there is an adjustment needed for most MB bios

4) Finding an MB that can handle enough PCI-e lanes can be a challenge

Ultimately I think I get 95% or higher performance on one PSU with appropriate power limiting and 48 lanes (vs a full 64)

replies(1): >>37422385 #
63. gorbypark ◴[] No.37421945[source]
I can't wait for my phone to have something like 512Gb-1TB of RAM to run some really interesting models locally :D
replies(1): >>37426671 #
64. easygenes ◴[] No.37421962{5}[source]
Depending on the task, the best open models will outperform GPT-3.5, but would be more expensive to run at comparable speed. GPT-4 is in a league of its own.
65. iaw ◴[] No.37421964{4}[source]
I build one on 4 used $800 3090s. All told $6-6.5K in used components to build an open air rig with consumer hardware.

I went with the cheapest threadripper I could find but after testing the performance hit for going from 16-8 PCIe lanes was actually not that large and I would be okay going with 8 lanes for each card.

66. slt2021 ◴[] No.37421998{4}[source]
it is not really a moat if one engineer can leave openai with all the secret sauce in his head and replicate it elsewhere (anthropic?)
replies(2): >>37422647 #>>37423076 #
67. yumraj ◴[] No.37422034{8}[source]
> You generally can't hook up large storage drives to nvme. Those are all tiny flash storage.

What’s your definition of large?

2TB and 4TB NVME are not tiny. You can even buy 8TB NVMEs, though those are more expensive and IMHO not worth it for this use case.

2TB NVMEs are $60-$100 right now.

You can attach several of those via Thunderbolt/USB4 enclosures providing 2500-3000 MB/s

68. neonsunset ◴[] No.37422072[source]
I'm not sure why this is downvoted but wanted to chime in that ML successes are taking place, first and foremost, despite Python's shortcomings, which are many.

The user experience of working with language is terrible because most tasks it is utilized in go way beyond "scripting" scenario, which Python was primarily made for (aside from also being easy to pick up and use language).

69. visarga ◴[] No.37422133{3}[source]
> vs by improving model architectures to be more efficient?

or data quality, you get more from small models if you use high quality data

70. easygenes ◴[] No.37422182{6}[source]
For LLM applications, the performance loss when power limiting 3090 to 200w is fairly low and you get peak perf/w.
replies(1): >>37426882 #
71. foobiekr ◴[] No.37422256{3}[source]
The cost isn’t sunk cost at all. These models need to be trained and retrained as data sets increase. Putting aside historical cutoff points, there’s a lot of data and kinds of data not currenty used and the costs even to train the current models is incredible.

I think you guys are missing a massive technical consideration which is cost. Training cost, offering cost. As with everything else in tech, outside of the bubble created by ZIRP over the last decade and a half (and the entire two generations of tech workers who never learned this important lesson thus far in their careers), costs matter and are a primary driver of technology success.

If you attached dollar costs to these models above, if the data was available, you’d quickly discover who (if anyone) has a sustainable business model and who doesn’t.

A sustainable model is what determines long term whether w technology is available and whether that leads to further improvement (and increasing sustainability/financial value).

replies(1): >>37422898 #
72. runjake ◴[] No.37422263{4}[source]
I think it's a safe bet to say it's not altruistic. And, if Meta were to wrestle away OpenAI's moat, they'd eagerly create their own, given the opportunity.
replies(3): >>37422875 #>>37423084 #>>37426473 #
73. foobiekr ◴[] No.37422273{4}[source]
This. We haven’t even entered the get-serious monetization era.

Now that the infinite free money pump has been turned down a bunch, we’re going to see what reality looks like.

replies(1): >>37430155 #
74. doctoboggan ◴[] No.37422279[source]
Georgi is doing so much to democratize LLM access, I am very thankful he is doing it all on apple silicon!
75. foobiekr ◴[] No.37422288{4}[source]
Commoditize your complement strategies can just as likely put a market into a zombie state in the long run.
76. foobiekr ◴[] No.37422380{4}[source]
Adoption and a mass of human feedback collected which is not available in the gleaned data sets.

Here’s another way to think about it. Why does ISA matter in CPUs? There are minor issues around efficiencies of various kinds, but the real advantage of any mainstream ISA is, in part, the availability of tooling (hence this was a correct and heavy early focus for the RISCV effort) but also a lot of ecosystem things you don’t see: for example, Intel and Arm have truly mammoth test and verification suites that represent thousands++ of man years of investment.

OpenAI almost certainly has a massive invisible accumulated value at this point.

The actual models themselves are the output in the same way that a packaged CPU is the output. How you got there matters almost as much or more.

replies(1): >>37426572 #
77. mk_stjames ◴[] No.37422385{5}[source]
Here are my current solutions, mostly:

1) I live in the EU with 240v mains and can pull 3600W from a single circuit and be OK. I'd also likely just limit the cards a bit for power efficiency as well as I do that already with a single 3090, running at 300W vs 350W max TDP makes such a small difference in performance I don't notice; this isn't gaming.

2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

4) See 3. AMD Epyc 7713 is currently this choice with 3200mhz ddr4 giving the memory bandwidth I need for a specific CPU workload as well.

I'm just currently trying to figure out if it is worth it dealing with the massive import duty I'd get hit with if I bought the H12SSL-i, RAM, and the EPYC from one of the many Chinese Ebay resellers that sell these things at basically wholesale. I get hit with 20% VAT plus another something like 20% duty on imported computer parts. It's close to the point where it would be cheaper for me to book a round trip flight to Taiwan and take parts home in my luggage. People in the USA don't realize how little they are paying for computing equipment, relatively.

So, minus the potential tax implications this can all be done for about 8000 EUR.

I do computational fluid dynamics work and am working on GPU-assisted solvers, which is the reason for the 64-core EPYC as parts of that workflow are still going to need CPU threads to pump the numbers to the GPGPUs. Also this is why I don't just run stuff in cloud like so many people will suggest when talking about spending 8K on a rig vs $2/hr per A100 at Lambda labs. My specific work needs to stay on-prem.

All told it is less than two A6000's at current prices and is a whole machine to boot.

replies(2): >>37422589 #>>37424346 #
78. wmf ◴[] No.37422416[source]
It's using the GPU so I guess not that many CPU threads are needed to feed the GPU.
79. iaw ◴[] No.37422589{6}[source]
> 2) Still may run a dual PSU, but with the CPU below and the cards at 1200W i'd max at around 1600W, add overhead, and there are some 2000W PSUs that could do it. Also being on 240v mains makes this easier on the PSU selection as I have seen many can source more wattage when on 230/240 vs 115/120.

One suggestion, also limit clock speeds/voltages. There are transients when the 3090 loads models that can exceed double their typical draw, 4 starting at once can draw substantially more than I expected.

80. GeekyBear ◴[] No.37422593{6}[source]
> Actual desktop PCs have many SATA ports

How many of those PCs have 10 Gigabit Ethernet by default? You can set up fast networked storage in any size you like and share it with many computers, not just one.

81. smoldesu ◴[] No.37422623[source]
> but when operating by everyday users the best UX wins

Is that not why OpenAI is ahead right now? For free, you can have access to powerful AI on anything with a web browser. You don't need to wait for your SSD to load the model, page it into memory and swap your preexisting processes like it would on a local machine. You don't need to worry about the local battery drain, heat, memory constraints or hardware limitations. If you can read Hacker News, you can use AI.

Given the current performance of local models, I bet OpenAI is feeling pretty comfortable from where they're standing. Most people don't have mobile devices with enough RAM to load a 13b, 4-bit Llama quantization. Running a 180B model (much less a GPT-4 scale model) on consumer hardware is financially infeasible. Running it at-scale, in the cloud is pennies on the dollar.

I'm not fond of OpenAI in the slightest, but if you've followed the state of local models recently it's clear why they keep coming out ahead.

replies(1): >>37422761 #
82. foobiekr ◴[] No.37422647{5}[source]
Name one software-based tech company where this isn’t true.
replies(1): >>37423025 #
83. anurag6892 ◴[] No.37422761{3}[source]
this advantage is not specific to OpenAI right? Any big cloud provider like Amazon/Google can host these open LLM models.
replies(2): >>37422795 #>>37424905 #
84. growt ◴[] No.37422780[source]
So how much ram did the machine have?
85. smoldesu ◴[] No.37422795{4}[source]
It's not exclusive, no. At OpenAI's scale though, they can afford to purchase their own hardware like a big cloud provider can. It's likely that OpenAI was running Nvidia's DGX servers in production before AWS and GCP even got their unit quotes.
86. passion__desire ◴[] No.37422875{5}[source]
Meta doesn't interact with its users in very obvious ways which MS, Google do. All its models magic happen behind the scenes. Meta can continue to release 2nd best models to undercut others and them going far too ahead. And Open Source community will take it from there. Dall-E is dead.
replies(2): >>37423063 #>>37427098 #
87. cs702 ◴[] No.37422892{4}[source]
Actually, we're saying the same thing: Models are becoming more commoditized, so profits will accrue, not to those companies who say they have the "best" models, but to the companies that have other kinds of advantages. When it comes to LLMs, no one has a technological advantage.
88. adam_arthur ◴[] No.37422898{4}[source]
GPT-4 cost on the order of $100 million, per Sam Altman.

This is orders of magnitude lower than many companies and government R&D budgets. It's easily financeable by 1000s of independently wealthy people and organizations. It's easily financeable by VC money. This is far cheaper than many other startups or product initiatives that have been tried. There are very likely to be many organizations that build models for the specific purpose of open sourcing the resulting model... the Falcon and Llama models are already proof enough of this

Costs to train equivalent models may increase in the short term due to race towards GPU consumption raising costs... but compute will get cheaper in aggregate over time due to improving compute tech.

And once the model is built it is largely a sunk cost, yes. All that needs to happen is for a single SoTA model to be made open to completely negate any advantage a competitor has. Monetization from LLMs will be driven by focused application of the models, not from providing an interface to a general model. High quality data holds more value than the resulting model

Not every query requires timeliness of data. Incorporating new data into an existing model is likely to be cheaper than retraining the model from scratch, but just speculation on my end.

replies(1): >>37429443 #
89. passion__desire ◴[] No.37422981{5}[source]
What if specialized smaller models is the best way ahead for community. I don't care if I am interacting with one big model which can do everything or I have to go to different websites to access specific models. All model sizes will be useful. Smaller models will be frequently used. Bigger less so.
90. beardedwizard ◴[] No.37423012[source]
What ever is he doing, we must protect this man at all costs.
91. slt2021 ◴[] No.37423025{6}[source]
microsoft? gogel? FB?
92. swyx ◴[] No.37423055{3}[source]
he already has? lol https://news.ycombinator.com/item?id=36215651
93. bugglebeetle ◴[] No.37423063{6}[source]
And if all open source extends their models, they can accrue those benefits back to themselves. This is already how they’ve become such a huge player in machine learning (open sourcing amazing stuff).
94. beardedwizard ◴[] No.37423070[source]
Absolutely! Local experimentation. I built a transcription and summarization pipeline for $0. If I want it to be faster, I can move it to beefier hardware. If I fail 1000s of times it still costs me nothing.

Privacy is the second case, I don't want to leak all my great ideas or data to openai or anyone else.

95. Frannyies ◴[] No.37423076{5}[source]
I only meant the trained model.

You would need to steal it all over again as soon as the next model is trained.

replies(1): >>37424110 #
96. beardedwizard ◴[] No.37423085{3}[source]
You what? You can run smaller and plenty powerful models on a m1 MacBook. Idk what the porn and violence angle is but maybe keep that one to yourself.
replies(1): >>37423271 #
97. sangnoir ◴[] No.37423084{5}[source]
> And, if Meta were to wrestle away OpenAI's moat, they'd eagerly create their own

Meta is already capable of monetizing content generated by the models: these models complement their business and they could not care less which model you're using to earn them advertising dollars, as long as you keep the (preferably high quality) content coming.

98. passion__desire ◴[] No.37423105{3}[source]
All the great points Salman Khan made about Khan Academy in his famous ted talk apply here. The only difference is LLMs can go from Eli5 to EliPhD in just few back and forth. Then to put cherry on the top, you can ask it summarize the conversation in a poem written in style of Walt Whitman.
99. ViktorBash ◴[] No.37423114[source]
It's refreshing to see how fast open LLMs are advancing in terms of the models available. A year ago I thought that besides for the novelty of it, running LLMs locally would be nowhere close to stuff like OpenAI's closed models in terms of utility.

As more and more models become open and are able to be run locally, the precedent gets stronger (which is good for the end consumer in my opinion).

100. ZoomerCretin ◴[] No.37423170[source]
GPT4 is still leagues ahead of the competition. Open source LLMs will be used more widely, but for the most demanding tasks, there is no alternative for GPT4.
replies(1): >>37506701 #
101. logicchains ◴[] No.37423271{4}[source]
One of the largest use-cases for local LLMs is NSFW chatbots, like DIY Replika, AI girl/boyfriends, as the hosted services are too censored to be used for this. Yes there are smaller models, but they're not as intelligent. Similarly people using LLMs as a writing aid need to use local ones if they're writing a story (or .e.g DnD campaign) involving violence, as the hosted ones are generally unwilling to narrate graphic violence, and the smarter the model, the better the story quality.

Given that censorship is one of the biggest complaints about the hosted LLMs, it should be no surprise that some of the main use-cases driving local LLMs are those involving creating content that censored LLMs are unwilling to create.

102. acdha ◴[] No.37423503{7}[source]
Dude, just admit you were wrong. This is just painful - especially as other people are pointing out that this is a hard number to beat.
103. xpe ◴[] No.37424079[source]
I buy this general argument, at least to extent that 'good enough' LLMs get commodified.

What are some of key aspects about scenarios where this commodification happens? Where it doesn't?

Speaking descriptively (not normatively), I see a lot of possibilities about how things will unfold hinging on (a) licensing, (b) desire for recent data, (c) desire for private data, (d) regulation.

104. slt2021 ◴[] No.37424110{6}[source]
no need to steal the model if training process can be reliably replicated/adopted in clean room implementation with additional optimisations.

startup as legal entity has close to 0 value, most value is in intellectual property which is stored and transmitted by meatbags.

105. matwood ◴[] No.37424271{3}[source]
> And 4 3090 are definitely lot more powerful and versatile than an M2 mac.

I'm trying to figure out how a bunch of graphics cards are more versatile than an entire computer. Maybe there's a very narrow use case.

106. eurekin ◴[] No.37424346{6}[source]
> 3) I'm leaning towards a Supermicro H12SSL-i motherboard which I've seen people run quad A5000's on without any hassle. Proper PCIe x16 slots spaces 2x apart, and i'd likely be watercooling all 4 cards together in a 2-slot spacing or installing MSI turbo heatsinks with blower fans / server rack style for 2 slot cooling.

I had a 3x3090 rig. 2 of them are fried. 2 of them had two sided Bykski blocks (would never fit next to each others 2 slot apart, water header block with 90 degree corner plumbing is full 4x slot size minimum). 3090 are notorious for their vram chips, on the backside, failing due to insufficient cooling.

Also, 1kW of heat needs to be evacuated. Few hours of full power can easily heat up neighbouring rooms as well.

107. zamadatix ◴[] No.37424659{3}[source]
I think that use case still matches the remote cluster use case better as a policy like "We can't use cloud" doesn't mean "we have to use our individual local workstations". This approach really makes sense for the "we have 1-3 people that want to really push this on a budget", beyond that big iron makes more sense. And this still helps with that IMO, it's just one step in getting to there from "only the largest can play".
replies(1): >>37424814 #
108. lambda_garden ◴[] No.37424763{3}[source]
> LLaMA, with up to 65B params, opened against Meta's wishes

They sure didn't try very hard to secure it. I wonder if it was their strategy all along.

replies(1): >>37426416 #
109. acchow ◴[] No.37424782{7}[source]
I don't really agree. This is a desktop machine so it will be staying put. It has thunderbolt 4 which can exceed 3GB/s (24Gbps) on external SSDs. Don't think that expansion is useless
110. acdha ◴[] No.37424814{4}[source]
Maybe, but that’s potentially slower and definitely much more expensive. A lot of people in those environments can get a $6k workstation a lot faster than a compute cluster which has to be supported, secured, etc.
111. nerbert ◴[] No.37424905{4}[source]
OpenAI's got the first mover advantage. It's everything if you don't fuck up.
112. noiv ◴[] No.37424918[source]
RAM may be growing, but free and acceptable content to train models isn't.

Question is which is the last model one might install to satisfy all needs.

113. baq ◴[] No.37425019{4}[source]
Wonder if someone is thinking of LLM specific RAM, slower but much denser. Bonus points for not having to reload the model after power cycling.

Maybe call this fantastic technology something idiotic like 3d XPoint?

replies(2): >>37425474 #>>37426620 #
114. ronsor ◴[] No.37425474{5}[source]
The problem with that is LLM speed is mostly bottlenecked by memory bandwidth. Slower RAM means worse performance.
115. AnthonyMouse ◴[] No.37426416{4}[source]
I suspect this was the goal of some of the people inside the company but imposing some nominal terms on it was the price of getting it through the bureaucracy, or maybe required by some agreement related to some mostly irrelevant but actually present subset of the original model.

Then the inevitable occurred and made it obvious that the restrictions were both impractical to enforce and counterproductive, so they released a new one with less of them.

116. AnthonyMouse ◴[] No.37426473{5}[source]
> And, if Meta were to wrestle away OpenAI's moat, they'd eagerly create their own, given the opportunity.

At which point the new underdogs would have an interest in doing to them what they're doing to OpenAI.

Assuming progress for LLMs continues at a rapid pace for an extended period of time. It's not implausible that they'll get to a certain level past which non-trivial progress is hard, and if there is an open source model at that level there isn't going to be a moat.

117. AnthonyMouse ◴[] No.37426572{5}[source]
> Here’s another way to think about it. Why does ISA matter in CPUs?

Honestly the answer is that it mostly doesn't.

An ISA isn't viable without tooling, but that's why it's the first thing they all get. The only ISA with any significant moat is x86, and that's because there is so much legacy closed source software for it that people still need but would have to be emulated on any other architecture. And even that only works as long as x86 processors are competitive; if they fell behind then customers would just eat the emulation overhead on something else.

Other ISAs don't even have that. Would anybody actually be surprised if RISC-V took a huge chunk of out ARM's market share in the not too distant future?

replies(1): >>37429430 #
118. AnthonyMouse ◴[] No.37426620{5}[source]
> slower but much denser. Bonus points for not having to reload the model after power cycling.

This is called a solid state drive.

replies(1): >>37430833 #
119. Havoc ◴[] No.37426643[source]
Great progress, but I also can't help but feel a sense of apprehension on the access front.

An M2 Ultra while consumer tech is affordable to a fairly small % of the world population.

120. AnthonyMouse ◴[] No.37426671{3}[source]
You can buy 768GB of DDR3 and an Ivy Bridge Xeon E5 to put it in for a total of around $500, most of which is the memory. (The CPUs wouldn't be fast for a model that size though.)
replies(1): >>37427110 #
121. AnthonyMouse ◴[] No.37426757{3}[source]
Apple is just using a wide memory bus, the same as GPUs and server-class x86 CPUs do. It's not even hard, it's just not something desktop CPUs previously had any use for so the current sockets don't support it.

And you could do the same thing without even changing the socket by including RAM on the CPU package as an L4 cache. Some of the Intel server CPUs are already doing this.

122. nmfisher ◴[] No.37426850{5}[source]
This isn’t really true (or at least, doesn’t apply across the board). Qwen (Alibaba’s open source model) outperforms GPT4 on Chinese language tasks, and I can further finetune it for my own tasks (which I’ve done, and I confirm it’s produces more natural output than GPT4).

Other benchmarks/anecdotes suggest fine-tuned code models are outperforming GPT4 too. The trend seems to be that smaller, fine-tuned task specific models outperform larger generalised models. It requires a lot of resources to pretrain the base model, but as we’ve seen, there’s no shortage of companies who are willing and able to do that.

Not to mention, all those other companies are already profitable, whereas OpenAI is already burning investor cash.

123. yumraj ◴[] No.37426882{7}[source]
So even with power limiting, with 4 3090s, you’re looking at 800w from GPUs alone. So about 1000w give or take. Yes?

M2 Ultra [0] seems to be max 295w

[0] https://support.apple.com/en-us/HT213100

replies(1): >>37429539 #
124. astrange ◴[] No.37427098{6}[source]
I think Dall-E isn't actually dead, but was merely renamed Bing Image Creator.
125. astrange ◴[] No.37427110{4}[source]
I'd be impressed if you fit that into a phone.
replies(1): >>37430712 #
126. catchnear4321 ◴[] No.37427799{3}[source]
i really can’t afford comments like this.
127. foobiekr ◴[] No.37429430{6}[source]
That's literally my point. The problem is that there's a massive amount of hidden infrastructure behind those that you don't see and that "oh look everyone has a big model" isn't as impressive as it sounds.
replies(1): >>37430771 #
128. foobiekr ◴[] No.37429443{5}[source]
I think you are overestimating R&D budgets for companies. Very few tech companies - even large ones - have R&D budgets in the $10B+ range, let alone $100B. Most of the fortune 100 isn't even $10B.
replies(1): >>37430039 #
129. two_in_one ◴[] No.37429522[source]
Just wondering what are local LLMs used for today? So far they look more like a.. promising.
130. easygenes ◴[] No.37429539{8}[source]
Yeah, but watt for watt the 3090s will output more tokens, as a single 3090 has more memory bandwidth than an M2 Ultra. That's the main performance constraint for LLMs.

Dramatically oversimplifying of course. There will be niches where one will be the right choice over the other. In a continuous serving context you'd mostly only want to run models which can fully fit in the VRAM of a single 3090, otherwise the crosstalk penalty will apply. 24GB VRAM is enough to run CodeLlama 34B q3_k_m GGUF with 10000 tokens of context though.

131. mistymountains ◴[] No.37429907{3}[source]
Cool it with the italics.
132. danielbln ◴[] No.37430039{6}[source]
Where do you get $100B from?
replies(1): >>37435993 #
133. 6510 ◴[] No.37430155{5}[source]
Okay ill tell you. You need to start a startup that sets up a good number of cameras at manual labor jobs. Most of the footage will be completely useless but every day you hit a once in a day event, every week you get a once in a week event, every month, every year, every decade etc! Then the guy working there for 40 years wacks pipe 224 with a hammer 50 cm from the outlet and production resumes.

The footage can be aggressively pruned to fit on the disk.

When the robot is delivered in 2033 it can easily figure out, from the footage, all these weird and rare edge cases.

The difference will be like that between a competent but new employee and someone with 10 years of experience.

I can see the Tesla bots disassembling the production line already. Or do you think it wont happen?

replies(2): >>37454779 #>>37495611 #
134. AnthonyMouse ◴[] No.37430712{5}[source]
It'll make phone calls. Just put a VoIP app on it.

Obviously what you can do in practice is put the interface on your phone. It doesn't have to run on battery to run locally.

135. AnthonyMouse ◴[] No.37430771{7}[source]
But the open source infrastructure is getting built too. And the infrastructure is mostly independent of the model. This is Falcon 180B running using the code from llama.cpp.
136. baq ◴[] No.37430833{6}[source]
Goes to show how badly Intel executed that one.
replies(1): >>37430876 #
137. AnthonyMouse ◴[] No.37430876{7}[source]
What? You can do this right now. Put your >100GB model on your SSD in your computer with <100GB of RAM and use mmap. It's not fast, but it runs.
replies(1): >>37432010 #
138. baq ◴[] No.37432010{8}[source]
My point is Intel had the perfect tech for this and killed it.

https://en.wikipedia.org/wiki/3D_XPoint

replies(1): >>37435180 #
139. zagfai ◴[] No.37432698[source]
however, GPT3.5 did not surprised me but GPT4 did. 3.5 just a kid.
140. AnthonyMouse ◴[] No.37435180{9}[source]
They didn't really. What this wants is gobs of memory bandwidth. The fastest NVMe SSDs can essentially saturate the PCIe bus. Using a dozen or more of them in parallel might even have reasonable performance for this. (Most desktops don't have this many PCIe lanes but HEDT and servers do). And they're a lot cheaper than Optane was.

To do better than that would have required the version of Optane that used DIMM slots, which was something like a quarter of the performance of actual DRAM for half the price.

So you had something that costs more than ordinary SSDs if your priority is cost and is slower than DRAM if your priority is performance. A lot of times a middle ground like that is still valuable, but since cache hierarchies are a thing, having a bit of fast DRAM and a lot of cheap SSD serves that part of the market well too.

And in the meantime ordinary SSDs got faster and cheaper and DRAM got faster and cheaper. Now you can get older systems with previous generation DRAM that are faster than Optane for less money. They stopped making it because people stopped buying it.

141. foobiekr ◴[] No.37435993{7}[source]
"orders of magnitude"
142. catchnear4321 ◴[] No.37440704{3}[source]
it seems infinitely cheaper to jailbreak poorly implemented publicly-facing gimmick LLM “use cases” and “demonstrations” that rely on / thinly veneer commercial apis.

(this is not financial advice and i am not a financial advisor.)

143. rogerdox14 ◴[] No.37452138{4}[source]
"recommended max working set size" is a property of the Mac the model is being run on, not the model itself. The model is smaller than that, otherwise it wouldn't be running on GPU.
144. Aerbil313 ◴[] No.37454779{6}[source]
Transformers can and do forget.
145. checkyoursudo ◴[] No.37495611{6}[source]
Assuming that this would work, which I am fine with granting for purposes of discussion, how does this method ever let you build anything new? Or make use of advances in production methods? Or completely reconfigure a production line because of some regulatory requirement?
146. eurekin ◴[] No.37506701{3}[source]
Anecdata confirmation: I've been toying around with LLMs for simple fun stuff, but when it comes to real work, GPT-4 delivers in spades.

I have cut many hours of debugging thanks to it. I could find issues easily, on-call in short conversation, when previously that was reserved as post mortem task.

Even reading documentation is nothing like before: once, I was looking for a single command to upload and presign a object in S3. SDK has tens of methods, which require careful scanning, if they do what I want. Going through documentation thoroughly would've taken me hours. GPT-4 simply found, no, there's no operation for that immediately.