total duration: 35.016288581s load duration: 21.790458ms prompt eval count: 1244 token(s) prompt eval duration: 1.042544115s prompt eval rate: 1193.23 tokens/s eval count: 213 token(s) eval duration: 33.94778571s eval rate: 6.27 tokens/s
total duration: 4m44.951335984s load duration: 20.528603ms prompt eval count: 1502 token(s) prompt eval duration: 773.712908ms prompt eval rate: 1941.29 tokens/s eval count: 1644 token(s) eval duration: 4m44.137923862s eval rate: 5.79 tokens/s
Compared to an API call that finishes in about 20% of the time it feels a bit slow without the recommended graphics card and what not is all I'm saying.
In terms of benchmarks, it seems unusually well tuned for the model size but I suspect its just a case of gaming the measurement by testing against it as part of the development of the model which is not bad in and of itself since I suspect every LLM who is in this space marketed to IT folks does the same thing tbh so its objective enough given that as a rough gauge of "Is this usable?" without heavy time expense testing it.