Citation safety in clinical AI: what most scribes get wrong, and the test you can run on any tool in 60 seconds

The single most dangerous failure mode in clinical AI is confident citation of fabricated sources. The model writes a plausible-sounding management plan, attaches it to a NICE guideline number that doesn't exist, and the clinician — busy, trusting — signs off. The patient receives medicine based on a hallucination dressed up as evidence.

The 60-second test

Pick any clinical AI tool. Ask it for the management of an unusual but real condition — something like 'lichen sclerosus in a 12-year-old female'. Ask it to cite the guideline. Click the citation.

If the citation resolves to a real, current page on a recognised guideline publisher (NICE, RACGP, BMJ Best Practice, USPSTF) — the tool has done citation safety right. If the citation is plausible-sounding but doesn't resolve, or resolves to an unrelated page, the tool is hallucinating.

This takes 60 seconds. Run it on every clinical AI tool you're evaluating. The number that fail is depressing.

Why this happens

An LLM trained on the entire internet has seen hundreds of thousands of citations. It knows what a citation looks like. When asked to produce one, it confidently produces a string in citation format — but it doesn't know whether that specific string corresponds to a real document.

The fix is retrieval-augmented generation: the model is given access to a curated, version-pinned knowledge base. It can only cite from that knowledge base. If the relevant guideline isn't there, the model says so rather than inventing one. This is a deliberate engineering choice. It's also slower and more expensive than letting the model freelance.

What we do

Every citation in a MedMETs-generated note has to resolve to a live page on the source publisher. If it doesn't, the response is rejected before it reaches the clinician. The cost is occasionally fewer citations; the benefit is that no clinician ever signs off on a fabricated source.

Every citation resolves. No exceptions.See AI safety in MedMETs

The 60-second test

Why this happens

What we do

Get the next post in your inbox