AI agents: Less capability, more reliability, please

We have been looking at Hamming distance vs time to signature for ambient note generation in medicine. Any other metrics? Lots of metrics in the ML papers, but a lot of them seem sus. They take a lot of work to reproduce or they are designed around some strategy like maxing out the easy true negatives (so you get desirable accuracy and F1 score), etc. as someone trying to build validation protocols I can get vendors to enable (need them to write certain data from memory to a DB table we can access) I’d welcome that discussion. Right now the MBAs running the hospital systems are doing whatever their ML buddies say without regard to patient or provider.