Show HN: Cancer diagnosis makes for an interesting RL environment for LLMs

How would a human classify the cancers? I assume the LLM training data does not include a whole bunch of cancer samples, so assumably there are some rules that it follows?

> While there exists several pathology-focused AI models

Would also be curious how the LLM compares to this and other approaches. What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist? Correct me if I'm wrong but this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance.

> What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist?

I should probably first clarify here, the disease classification tasks are about subtyping the type of cancer (i.e classifying a case as invasive ductal carcinoma of the breast) rather than just binary malignant/benign classification so random guessing is much more difficult and makes this model performance more impressive.

> Would also be curious how the LLM compares to this and other approaches.

There aren't a lot of public general purpose pathology benchmarks. There are some like (https://github.com/sinai-computational-pathology/SSL_tile_be...) but focus on just binary benign/malignant classification tasks and binary biomarker detection tasks.

I am currently working on self-hosting the available open-source models.

> this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance

Yep, your intuition is right here, and actually the expectation is probably closer to mid-high 90%, especially for FDA approval (and most AI tools position as co-pilots at the moment). There is obviously a long way to go, but what I find about interesting about this approach is that it allows LLMs to generalize across (1) a variety of tissue types and (2) pathology tasks such as IHC H-score scoring.

You should read out to Eric Topol...

very cool, have you tried some of the newer segmenting models to see if they make a difference? I've seen some in the past two weeks that look really effective...I wonder if it could help out the RL environment

Do you think finetuning these LLMS would bring about comparable results to specific models trained for this?

Did you fine tune GPT 5, Sonnet 4.5, or any of the other models? Or, were the models able to do this "out of the box?"

none of the models you mentioned are open source...

Nope, I just did some prompt engineering on ootb models. I thought about doing some fine-tuning on like Qwen but think that there is still more performance to be squeezed out with just prompts here.

I think so. It feels like there is more to be squeezed from just better prompts but was going to play around with fine-tuning Qwen3

Nope I haven't, I can take a look and see if I can fit it in

Wow this is pretty interesting. Excited to see the benchmark!