Building LLM evaluation suites. Basically trying to test LLMs for privacy problems (data leakage/memorisation, PII extraction sort of thing), hallucination(RAG, summarisation etc) and security/compliance stuff(like bias/fiarness, toxicity, jailbreaks/prompt injection).
Involves a bunch of reading research papers, figuring out which ones are relevant to enterprise customers and getting our ML team to build it out. The most interesting part of this is how you present the insights of a given test to a customer in a consumable and usable format. (Ex: Just dumping a bunch of RAG hallucination metrics isn't enough but you want to figure out what are the key insights and interpretations of these metrics which could be useful to a data scientist or ML engineer)