Ask HN: How are Markov chains so different from tiny LLMs?

jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$ ./SLM10b_train UriAlon.txt 3 Training model with order 3... Skip-gram detection: DISABLED (order < 5) Pruning is disabled Calculating model size for JSON export... Will export 29832 model entries Exporting vocabulary (1727 entries)... Vocabulary export complete. Exporting model entries... Processed 12000 contexts, written 28765 entries (96.4%)... JSON export complete: 29832 entries written to model.json Model trained and saved to model.json Vocabulary size: 1727 jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$ ./SLM9_gen model.json

Thanks for your kind words. My code is not really novel, but it is not like the simplistic Markov chain text generators that are found by the ton on the web.

I will further improve my code and publish it when I am satisfied on my Github account.

It started as a Simple Language Model [0] as they differ from ordinary Markov generators by incorporating a crude prompt mechanism and a kind of very basic attention mechanism named history. My SLM uses Partial Matching (PPM). The one in the link is character-based and is very simple, but mine uses tokens and is 1300 C lines long.

The tokenizer tracks the end of sentences and paragraphs.

I didn't use part-of-a-word algorithms as LLMs do, but it's trivial to incorporate. Tokens are represented by a number (again as in LLMs), not a character chain.

I use Hash Tables for the Model.

There are several mechanisms used for fallbacks when the next state function fails. One of them uses the prompt. It is not demonstrated here.

Several other implemented mechanisms are not demonstrated here, like model pruning, skip-grams. I am trying to improve this Markov text generator, and some tips in the comments will be of great help.

But my point is not to make an LLM, it's just that LLMs produce good results not because of their supposedly advanced algorithms, but because of two things:

- There is an enormous amount of engineering in LLMs, whereas usually there is nearly none in Markov text generators, so people get the impression that Markov text generators are toys.

- LLMs are possible because they use impressive hardware improvements over the last decades. My text generator only uses 5MB of RAM when running this example! But as commentators told, the size of the model explodes quickly, and this is a point I should improve in my code.

And indeed, LLMs, even small LLMs like NanoGPT are unable to produce results as good as my text generator with only 42KB of training text.

https://github.com/JPLeRouzic/Small-language-model-with-comp...