The problem is, you have to know enough about the subject on which you're asking a question to land in the right place in the embedding. If you don't, you'll just get bunk. (I know it's popular to call AI bunk "hallucinations" these days, but really if it was being spouted by a half wit human we'd just call it "bunk".)
So you really have to be an expert in order to maximize your use of an LLM. And even then, you'll only be able to maximize your use of that LLM in the field in which your expertise lies.
A programmer, for instance, will likely never be able to ask a coherent enough question about economics or oncology for an LLM to give a reliable answer. Similarly, an oncologist will never be able to give a coherent enough software specification for an LLM to write an application for him or her.
That's the achilles heel of AI today as implemented by LLMs.
That’s not true.
i don't see how this will every work. Even in hard science there's debate over what content is trustworthy and what is not. Imagine trying to declare your source of training material on religion, philosophy, or politics "trustworthy".
But really, you leave the curation to real humans, institutions with ethical procedures already in place. I don't want Goole or Elon dictating what truth is, but I wouldn't mind if NASA or other aerospace institutions dictated what is truth in that space.
Of course, the dataset should have a list of every document/source used, so others can audit it. I know, unthinkable in this corporate world, but one can dream.