(github.com)

1. ◴[28 Aug 25 23:19 UTC] No.45058122[source]▶

2. anuarsh ◴[28 Aug 25 23:45 UTC] No.45058298[source]▶

Hi everyone, any comments or questions are appreciated

3. attogram ◴[29 Aug 25 00:43 UTC] No.45058682[source]▶

"~20 min for the first token" might turn off some people. But it is totally worth it to get such a large context size on puny systems!

replies(1): >>45058870 #

4. anuarsh ◴[29 Aug 25 01:09 UTC] No.45058870[source]▶

>>45058682 #

Absolutely, there are tons of cases where interactive experience is not required, but ability to process large context to get insights.

replies(1): >>45061478 #

5. Haeuserschlucht ◴[29 Aug 25 06:32 UTC] No.45060882[source]▶

>>45058121 (OP) #

20 minutes is a huge turnoff, unless you have it run over night.... Just to get the hint that you should exercise self care in the morning when presenting a legal paper and have the ai check it for flaws.

replies(1): >>45067676 #

6. Haeuserschlucht ◴[29 Aug 25 06:34 UTC] No.45060903[source]▶

>>45058121 (OP) #

It's better to have software erase all private details from text and have it checked by cloud ai to then have all placeholders replaced back at your harddrive.

7. attogram ◴[29 Aug 25 08:09 UTC] No.45061478{3}[source]▶

>>45058870 #

It would be interesting to see some benchmarks of this vs, for example, Ollama running localy with no timeout

8. anuarsh ◴[29 Aug 25 18:24 UTC] No.45067676[source]▶

>>45060882 #

We are talking about 100k context here. 20k would be much faster, but you won't need KVCache offloading for it

↑

Show HN: oLLM – LLM Inference for large-context tasks on consumer GPUs