For this to "work" you need to have a metric that shows that AIs perform as well, or nearly as well, as with the uncompressed documentation on a wide range of tasks.
For this to "work" you need to have a metric that shows that AIs perform as well, or nearly as well, as with the uncompressed documentation on a wide range of tasks.
You can use success rate % over N runs for a set of problems, which is something you can compare to other systems. A separate model does the evaluation. There are existing frameworks like DeepEval that facilitate this.
Having data is how we learn and build intuition. If your experiments showed that modern LLMs were able to succeed more often when given the llm-min file, then that’s an interesting result even if all that was measured was “did the LLM do the task”.
Such a result would raise a lot of interesting questions and ideas, like about the possibility of SKF increasing the model’s ability to apply new information.
The job of any context retrieval system is to retrieve the relevant info for the task so the LLM doesn't hallucinate. Maybe build a benchmark based on less-known external libraries with test cases that can check the output is correct (or with a mocking layer to know that the LLM-generated code calls roughly the correct functions).
Or better yet, check directly the hidden state difference between a model feed with the original prompt and one with the shrinked prompt.
This should avoid remove the randomness of the results.
Run the same questions against a model with the unminified and the minified and show the results side-by-side and see how, in your subjective opinion, they hold up.