No Black Boxes: Knowledge-Enhanced Causal AI Agents for Healthcare

The Problem

Deep Learning in Healthcare is a Black Box

Clinicians need to understand why a model makes a prediction — not just what it predicts. Current models fail on both counts.

✗ Traditional Deep Learning

✗ Opaque reasoning — no explanations
✗ No clinician interactivity or customization
✗ Cannot incorporate domain expertise
✗ Undermines clinician trust and adoption

✓ II-KEA (Our Approach)

✓ Explicit causal reasoning & explanations
✓ Clinicians can inject their own knowledge
✓ Customizable goals and knowledge bases
✓ Superior performance on real EHR datasets

Methodology

How II-KEA Works

Three LLM agents collaborate to go from a patient's diagnosis history to an interpretable, causally-grounded prediction.

INPUT

🧑‍⚕️

Patient History

EHR · Diagnosis Records

ICD-9Multi-visit

📊

Transition Matrix

Disease Probabilities A^T

0.30.40.0

0.30.20.6

0.50.40.1

🎯

Candidate Diseases

Shortlisted by Matrix

✓ Hypertension

✓ Diabetes

+ more…

Agent 1

🤖

Knowledge Synthesis

RAG · Vector DB

"Medical knowledge related to conditions associated with…"

Agent 2

🤖

Causal Discovery

DAG · Fitting Scores

"Produce a DAG to represent causality… Output in JSON form."

🔄 Repeat w/ Fitting Scores

Agent 3

🤖

Decision Making

Prediction · Clinician

"…the set of diseases the patient may be diagnosed in the future."

👨‍⚕️ Clinician-in-Loop

OUTPUT

📋

Diagnosis

Ranked ICD-9 Codes

💬

Explanations

Reasoning Chain

🕸️

Causal Graph

DAG.json

Evaluation

State-of-the-Art Performance

II-KEA is evaluated on two real-world EHR benchmarks and achieves superior performance while providing interpretability that pure deep learning models cannot.

MIMIC-III

Model	w-F1	R@10	R@20
RETAIN	18.37	32.12	32.54
Dipole	14.66	28.73	29.44
SeqCare	24.36	37.47	40.53
GT-BEHRT	25.21	36.15	40.97
GraphCare	25.16	36.74	41.89
DualMAR	25.37	38.24	41.86
II-KEA (Ours)	28.61	38.52	43.86

MIMIC-IV

Model	w-F1	R@10	R@20
RETAIN	23.11	37.32	40.15
Dipole	22.16	36.21	38.74
SeqCare	26.12	42.91	46.25
GT-BEHRT	30.17	44.93	50.67
GraphCare	27.59	42.07	48.19
DualMAR	27.97	44.07	48.19
II-KEA (Ours)	29.87	45.66	51.73

Results reported as average (%) over 5 runs. w-F1 = weighted F1; R@k = Recall@k.

🧠

Causal Transparency

Every prediction comes with a causal graph showing which prior conditions likely caused the new diagnosis.

💬

Clinician-in-the-Loop

Clinicians can add their own knowledge sources and inject comments to personalize predictions.

📖

Knowledge-Grounded RAG

Retrieval Augmented Generation ensures predictions align with up-to-date medical literature.

🔄

Iterative Refinement

The causal discovery agent continuously improves the graph using data fitting scores until convergence.

Future Work

What's Next

II-KEA opens a promising paradigm for interpretable and interactive clinical AI. Here's where we're headed.

📚

Richer Domain Knowledge

The current system uses Wikipedia as a proof-of-concept knowledge base. Future work will integrate more specialized medical knowledge sources to improve fine-grained diagnosis prediction.

🩻

Broader Clinical Tasks

Beyond diagnosis prediction, we plan to extend II-KEA to support treatment planning, medication recommendation, and other clinician-facing tasks.

👥

Multi-Stakeholder Collaboration

Current interactions are limited to individual clinicians. Future iterations will enable collaborative decision-making involving multiple stakeholders for holistic, patient-centered care.