My guess is that there was an error during quantization that resulted in a large amount of the weights not properly being used. A potential test would be to compare the number of page faults between quantized an unquantized model and confirm they are roughly the same proportionally. This could also explain how e.g. gpt4all seem to notice better performance on unquantized weights when there really shouldn't be.