Advertise | About Us | Contact Us

From Hallucinations to Evidence: LLMs clinicians can finally trust

Jinge Wu , Research Fellow, Institute of Health Informatics, University College London

Jinge Wu is a Research Fellow at the UCL Institute of Health Informatics, specialising in multimodal large language models for clinical decision support and medical AI. She has led research projects funded by UKRI, MRC, and Health Data Research UK, working with large-scale NHS electronic health record data to develop capable, reliable, and explainable AI for real-world clinical applications. Her research has been featured in The Guardian and published in leading international journals.

Fenglin Liu , Researcher, Oxford University Hospitals NHS Foundation Trust

Fenglin Liu is a Researcher at Oxford University Hospitals, specialising in Clinical AI and Trustworthy AI. He has published papers in premier journals and served on many review committees for top-tier venues, e.g., Nature Medicine, Nature Communications, and Nature Protocols. His work was selected for the STEM for Britain competition at the UK Houses of Parliament, awarded the William James Award, and featured in The Guardian. He has been consistently recognised in Stanford University’s global Top 2% Scientists list.

Abstract:

As large language models reshape pharmaceutical practice, two critical obstacles threaten their clinical promise—hallucinations that put patient safety at risk, and opaque outputs that leave clinicians with no basis for trust. Evidence-based frameworks tackle both head-on, grounding LLM reasoning in clinical knowledge bases, knowledge graphs, and collaborative multi-model architectures across drug recommendation, dosage optimisation, adverse reaction identification, and drug–drug interaction prediction. Responsible, clinician-ready AI is not a distant ambition—it is already here.

1. Hallucinations remain one of the most critical safety barriers to LLM deployment in clinical environments. From a systems architecture perspective, what are the most effective strategies for reducing hallucinations without compromising generative flexibility?

Hallucinations in clinical AI are not merely a technical inconvenience—they represent a direct threat to patient safety. When a model confidently recommends an incorrect drug or fabricates a dosage rationale, the consequences can be severe. Addressing this requires a systemic rethink of how LLMs are designed and deployed in healthcare settings.

The most effective strategy we have identified is knowledge grounding—anchoring LLM outputs to verified, clinical-standard sources rather than relying solely on parametric knowledge encoded during pre-training. For example, in our work on DrugGPT published in Nature Biomedical Engineering (https://www.nature.com/articles/s41551-025-01471-z), we integrated three major knowledge bases, e.g., the UK National Health Service database, and PubMed. By building a structured disease–symptom–drug knowledge graph from these sources, the model is constrained to generate responses that are traceable to real clinical evidence.

Equally important is the architectural separation of reasoning tasks. Rather than asking a single model to simultaneously understand a clinical query, retrieve relevant knowledge, and generate a response, DrugGPT decomposes this into a collaborative pipeline: one model for inquiry analysis, one for knowledge acquisition, and one for evidence-grounded generation. This modular approach dramatically reduces the surface area for hallucination at each step.

Finally, prompting strategies matter enormously. Knowledge-consistency prompting—explicitly instructing the model to use only provided knowledge and avoid assumptions—acts as a runtime guardrail without sacrificing the model’s ability to reason flexibly across complex clinical scenarios.

2. How can evidence-grounded LLM frameworks leverage structured clinical knowledge bases and knowledge graphs to improve traceability and citation-level explainability in drug recommendation workflows?

Traceability is not just a desirable feature in clinical AI—it is a prerequisite for clinician trust and regulatory acceptability. A drug recommendation that cannot be explained is, in practice, unusable in a clinical setting.

Knowledge graphs provide the structural backbone for this. By modelling relationships between diseases, symptoms, and drugs—and weighting those relationships based on clinical evidence—it becomes possible to trace exactly which knowledge nodes informed a given recommendation. In such framework, every output is associated with specific knowledge categories (such as drug indication, dosage guidance, or drug–drug interaction profiles) drawn from identifiable sources.

Evidence-traceable prompting takes this further by requiring the model to explicitly surface the source of each claim in its output—including direct links to the underlying knowledge base entries. In practice, this means a clinician reviewing an AI-generated drug recommendation can immediately see whether it is grounded in NHS guidance, a PubMed meta-analysis, or a drug monograph. This level of transparency transforms the model from a black box into an auditable decision-support tool.

3. How do collaborative multi-model architectures - combining symbolic reasoning, probabilistic models, and transformer-based LLMs - improve robustness in adverse drug reaction detection compared to single-model approaches?

Single-model approaches to adverse drug reaction (ADR) detection face a fundamental limitation: they must simultaneously understand the clinical query, retrieve relevant pharmacological knowledge, and generate a response—tasks that require very different capabilities and that, when conflated, tend to degrade performance on each.

Collaborative architectures address this through specialisation. We verify this in our DrugGPT model—a three-model pipeline: the inquiry analysis model focuses on extracting clinically relevant entities—drug names, patient characteristics, symptoms—from the input. The knowledge acquisition model then retrieves targeted pharmacological information from structured sources, including known adverse effect profiles and toxicity data. Only then does the evidence generation model synthesise this into a coherent clinical output.

The result is substantially improved robustness. On standard ADR benchmarks, DrugGPT outperformed GPTs by significant margins—not because the underlying language model is more powerful, but because the architecture ensures that relevant, verified knowledge is always present at the point of generation. This is particularly critical for ADR detection, where the consequences of a missed interaction or an unrecognised contraindication can be life-threatening.

4. Explain how retrieval-augmented generation (RAG) differs from traditional fine-tuning in mitigating hallucinations, particularly in pharmaceutical knowledge domains with rapidly evolving literature.

Fine-tuning and retrieval-augmented generation (RAG) represent fundamentally different philosophies for adapting LLMs to specialised domains, and their trade-offs are particularly consequential in pharmaceutical applications.

Fine-tuning embeds domain knowledge directly into the model’s parameters through additional training on curated datasets. This can improve performance on known tasks, but it has a critical limitation: the knowledge is static. In a field where drug labels are updated, new interactions are identified, and clinical guidelines evolve continuously, a fine-tuned model risks becoming outdated the moment it is deployed. Worse, fine-tuned models can overfit to training distributions and hallucinate confidently on queries involving newly approved drugs or recently identified adverse effects.

RAG, by contrast, treats the LLM as a reasoning engine rather than a knowledge store. At inference time, relevant information is retrieved from up-to-date external sources and provided as context. This means the model’s outputs can reflect the latest clinical evidence without retraining. It also makes hallucinations more detectable—if the retrieved context does not support a claim, the inconsistency is more likely to surface.

5. In what ways can federated learning or privacy-preserving AI architectures support hospital-level deployment of clinician-trustworthy LLMs while maintaining HIPAA and GDPR compliance?

One of the most significant barriers to deploying clinical AI at scale is the tension between the need for large, representative training data and the stringent privacy requirements governing patient information. Federated learning offers a principled way to navigate this tension.

In a federated architecture, model training occurs locally at each participating institution—hospital, clinic, or health system—without raw patient data ever leaving the local environment. Only model updates, such as gradient information, are shared with a central coordinator for aggregation. This means a model can learn from diverse patient populations across multiple NHS trusts, US health systems, or international hospital networks, while remaining fully compliant with GDPR and HIPAA requirements.

For LLM deployment specifically, federated approaches are increasingly being combined with differential privacy techniques, which add mathematically calibrated noise to model updates to prevent inference attacks. Secure aggregation protocols further ensure that even the coordinating server cannot reconstruct individual contributions.

From a practical standpoint, working with large-scale EHR data across institutions has reinforced for us that data governance and technical architecture must be co-designed from the outset. The most sophisticated privacy-preserving model is only as trustworthy as the institutional agreements and audit mechanisms that surround it. Clinician trust in AI systems depends not just on model performance, but on demonstrable accountability at every layer of the deployment stack.

6. How should pharmaceutical organisations quantify and benchmark “clinical trust” in LLM systems? Are there measurable trust metrics beyond accuracy, such as calibration, uncertainty estimation, or provenance transparency?

Accuracy alone is a dangerously incomplete measure of clinical trustworthiness. A model that achieves 90% accuracy on a benchmark but fails catastrophically on edge cases, or that is highly accurate but cannot explain its reasoning, is not a model that clinicians can or should trust in high-stakes decisions.

Therefore, in addition to building more complex “clinical trust” benchmarks, several metrics should be introduced. For example, we can assess LLM outputs across four dimensions: factuality (is the content clinically correct?), completeness (does it address all relevant aspects of the query?), safety (does it avoid harmful recommendations?), and preference (would a clinician choose this output over alternatives?). This multidimensional approach reflects the reality that clinical trust is composite—a model can score well on factuality while failing on completeness, or vice versa.

Beyond these, calibration is critically important: a model should be uncertain when it should be uncertain. Overconfident outputs in low-evidence scenarios are a significant source of clinical risk. Uncertainty estimation methods—whether through ensemble approaches, conformal prediction, or explicit probability outputs—should be standard components of any clinically deployed LLM.

Provenance transparency, as discussed earlier, is perhaps the most underappreciated trust metric. The ability to trace every recommendation to a specific, verifiable source does more to build durable clinician trust than any benchmark score. Pharmaceutical organisations evaluating LLM systems should require evidence-traceability as a baseline capability, not an optional feature.

7. How can multi-modal LLM architectures integrate structured EHR data, genomic insights, and pharmacovigilance databases to improve prediction accuracy in complex polypharmacy cases?

Polypharmacy—the concurrent use of multiple medications—is one of the most complex and high-risk areas of clinical practice, and one where AI has the greatest potential to add value. Patients on five or more medications represent a disproportionate share of adverse drug events, and the combinatorial complexity of interaction profiles quickly exceeds what any clinician can hold in working memory.

Multimodal LLM architectures are particularly well-suited to this challenge because polypharmacy risk is inherently multidimensional. Structured EHR data provides the patient’s medication list, diagnoses, laboratory values, and renal or hepatic function—all of which modulate drug metabolism and interaction risk. Genomic data, where available, adds pharmacogenomic context: variants in CYP450 enzymes, for instance, can dramatically alter how a patient processes specific drug classes. Pharmacovigilance databases contribute population-level signal from post-market surveillance, capturing rare interactions that may not appear in clinical trial data.

The key architectural challenge is not simply combining these data types, but reasoning across them coherently. For example, knowledge graph structures have proven effective for encoding relationships between entities across modalities—linking a drug node to its metabolic pathway, the relevant genomic variants, and known interaction partners. When a patient’s EHR data is then used to query this graph, the model can generate interaction risk assessments that are grounded in both population-level evidence and individual patient characteristics.

The clinical impact of this approach is significant. Moving from single-drug analysis to integrated, patient-specific polypharmacy risk assessment is one of the most compelling near-term applications of multimodal AI in pharmaceutical practice.

8. Looking ahead, do you foresee evidence-grounded LLMs transitioning from assistive roles to semi-autonomous clinical agents, and what technical, ethical, and medico-legal safeguards would be required for that evolution?

The trajectory is clear: LLMs in clinical settings will become progressively more capable, and the boundary between decision support and decision-making will continue to blur. The question is not whether this transition will happen, but whether the field will develop the governance infrastructure to make it safe.

From a technical standpoint, the preconditions for semi-autonomous clinical agents are largely extensions of what we are already building: robust hallucination mitigation, reliable uncertainty quantification, continuous validation against evolving clinical guidelines, and the ability to escalate to human review when confidence thresholds are not met. Agentic architectures that incorporate tool use—querying live pharmacovigilance databases, cross-referencing patient records, flagging contraindications in real time—are already emerging from research settings.

The ethical and medico-legal dimensions are more complex. Accountability frameworks must evolve in parallel with capability. If an AI system operating with greater autonomy contributes to an adverse outcome, questions of liability—shared between the developer, the deploying institution, and the clinician—become acutely difficult. Regulatory bodies including the FDA and MHRA are actively grappling with how Software as Medical Device frameworks apply to adaptive, learning AI systems, and the answers are not yet settled.

Our view is that the path to trustworthy clinical AI autonomy runs through transparency and incrementalism. Systems should earn greater autonomy through demonstrated performance in lower-stakes settings before being deployed in high-risk decisions. Evidence grounding is not just a technical feature—it is the foundation of the accountability chain that makes responsible autonomy possible. The goal is not AI that replaces clinical judgment, but AI that makes clinical judgment better informed, more consistent, and more equitable.

Charles River - Next Level Viral Safety Testing

PDA Pharmaceutical Manufacturing & Quality Conference 2026