←back to thread

176 points marv1nnnnn | 1 comments | | HN request time: 0.41s | source
Show context
iandanforth ◴[] No.43995844[source]
I applaud this effort, however the "Does it work?" section answers the wrong question. Anyone can write a trivial doc compressor and show a graph saying "The compressed version is smaller!"

For this to "work" you need to have a metric that shows that AIs perform as well, or nearly as well, as with the uncompressed documentation on a wide range of tasks.

replies(5): >>43996061 #>>43996217 #>>43996319 #>>43996840 #>>44003395 #
marv1nnnnn ◴[] No.43996061[source]
I totally agreed with your critic. To be honest, it's even hard for myself to evaluate. What I do is select several packages that current LLM failed to handle, which are in the sample folder, `crawl4ai`, `google-genai` and `svelte`. And try some tricky prompt to see if it works. But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver. I actually prepared a comparison, cursor vs cursor + internet vs cursor + context7 vs cursor + llm-min.txt. But I thought it was stochastic, so I didn't put it here. Will consider add to repo as well
replies(5): >>43996846 #>>43997120 #>>43997327 #>>44002248 #>>44002383 #
1. ricardobeat ◴[] No.43996846[source]
> But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver

You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.