Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

1. LASR ◴[19 Nov 24 02:25 UTC] No.42179539[source]▶

What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.

There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models.

replies(2): >>42179814 #>>42191188 #

2. TechDebtDevin ◴[19 Nov 24 03:20 UTC] No.42179814[source]▶

>>42179539 (TP) #

Like what..

replies(2): >>42180079 #>>42180145 #

3. davidfiala ◴[19 Nov 24 04:21 UTC] No.42180079[source]▶

>>42179814 #

Imagine increasing the quality and FPS of those AI-generated minecraft clones and experiencing even more high-quality, realtime AI-generated gameplay

(yeah, I know they are doing textual tokens. but just sayin..)

edit: context is https://oasisaiminecraft.com/

4. vineyardmike ◴[19 Nov 24 04:42 UTC] No.42180145[source]▶

>>42179814 #

You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.

You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.

I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.

AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.

replies(1): >>42182913 #

5. manmal ◴[19 Nov 24 12:57 UTC] No.42182913{3}[source]▶

>>42180145 #

> barely beat the speed a human can read

4o is way faster than a human can read.

6. TeeWEE ◴[20 Nov 24 06:03 UTC] No.42191188[source]▶

>>42179539 (TP) #

How can a rule book help fixing incidents. I mean I hope every incident is novel. Since you solve the root issue. So every time you need to dig in the code, or recently deployed code and correlate it with your production metrics.

Or is the rulebook a simple rollback?