Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference

You can create massive variants of OpenAI's 01 model. The "Chain of Thought" tools become way more useful when you can get when you can iterate 100x faster. Right now, flagship LLMs stream responses back, and barely beat the speed a human can read, so adding CoT makes it really slow for human-in-the-loop experiences. You can really get a lot more interesting "thoughts" (or workflow steps, or whatever) when it can do more, without slowing down the human experience of using the tool.

You can also get a lot fancier with tool-usage when you can start getting an LLM to use and reply to tools at a speed closer to the speed of a normal network service.

I've never timed it, but I'm guessing current LLMs don't handle "live video" type applications well. Imagine an LLM you could actually video chat with - it'd be useful for walking someone through a procedure, or advanced automation of GUI applications, etc.

AND the holy-grail of AI applications that would combine all of this - Robotics. Today, Cerebras chips are probably too power hungry for battery powered robotic assistants, but one could imagine a Star-Wars style robot assistant many years from now. You can have a robot that can navigate some space (home setting, or work setting) and it can see its environment and behavior, processing the video in real-time. Then, can reason about the world and its given task, by explicitly thinking through steps, and critically self-challenging the steps.