←back to thread

176 points marv1nnnnn | 1 comments | | HN request time: 0.001s | source
Show context
iandanforth ◴[] No.43995844[source]
I applaud this effort, however the "Does it work?" section answers the wrong question. Anyone can write a trivial doc compressor and show a graph saying "The compressed version is smaller!"

For this to "work" you need to have a metric that shows that AIs perform as well, or nearly as well, as with the uncompressed documentation on a wide range of tasks.

replies(5): >>43996061 #>>43996217 #>>43996319 #>>43996840 #>>44003395 #
marv1nnnnn ◴[] No.43996061[source]
I totally agreed with your critic. To be honest, it's even hard for myself to evaluate. What I do is select several packages that current LLM failed to handle, which are in the sample folder, `crawl4ai`, `google-genai` and `svelte`. And try some tricky prompt to see if it works. But even that evaluation is hard. LLM could hallucinate. I would say most time it works, but there are always few runs that failed to deliver. I actually prepared a comparison, cursor vs cursor + internet vs cursor + context7 vs cursor + llm-min.txt. But I thought it was stochastic, so I didn't put it here. Will consider add to repo as well
replies(5): >>43996846 #>>43997120 #>>43997327 #>>44002248 #>>44002383 #
timhigins ◴[] No.43997327[source]
> LLM could hallucinate

The job of any context retrieval system is to retrieve the relevant info for the task so the LLM doesn't hallucinate. Maybe build a benchmark based on less-known external libraries with test cases that can check the output is correct (or with a mocking layer to know that the LLM-generated code calls roughly the correct functions).

replies(1): >>44000925 #
1. marv1nnnnn ◴[] No.44000925[source]
Thanks for the feedback. This will be my next step. Personally I feel it's hard to design those test cases (by myself)