AIDx: A Locally Deployable AI Assistant Shows Strong Benchmark Performance on Medical Question-Answering Tasks

June 11, 2026 By Blab.com AI Team

The United States emergency department sees almost 400,000 patients each day, a volume that strains wait times and fuels clinician burnout. Artificial‑intelligence tools that can slip into existing electronic health‑record (EHR) workflows promise relief, yet most large language models (LLMs) are too massive for on‑premises use and rarely integrate with clinical systems.

A new study introduces AIDx, an integrated AI system that runs entirely on hospital infrastructure. Its core component, AIDx‑Copilot, is a Mixtral‑8x7B‑Instruct LLM fine‑tuned on de‑identified clinical notes from the MIMIC‑IV database. The training pipeline transforms each patient chart into a series of time‑stamped snapshots that preserve the chronological order of events and guard against future‑information leakage. From these snapshots, researchers generated roughly eight million question–answer pairs that mirror the kinds of clinical inquiries a physician might pose.

To gauge performance, the authors employed the MultiMedQA benchmark suite, which aggregates nine public medical QA datasets, including USMLE‑style questions and specialty‑specific exams. The evaluation followed a deterministic, single‑pass protocol: temperature set to zero, no chain‑of‑thought or voting, and a strict prompt that required the model to return a single letter answer. Accuracy was measured by exact match with the reference label, and a Wilson 95 % confidence interval was reported for each dataset.

AIDx‑Copilot achieved a mean accuracy of 83.6 % across the nine datasets. The highest scores appeared on MMLU Professional Medicine (93.4 %) and MMLU Clinical Knowledge (90.0 %). The lowest performance was on MedMCQA (70.7 %) and MMLU Anatomy (78.1 %). The study compared four configurations: the base Mixtral model, the base model with retrieval‑augmented generation (RAG) from an open‑access medical textbook index, the fine‑tuned model without RAG, and the fine‑tuned model with RAG. Fine‑tuning alone raised mean accuracy by 17.8 percentage points over the base model, while adding RAG delivered a modest average gain of 0.4 percentage points, with larger improvements on datasets that rely on factual recall.

An error analysis of 200 randomly selected incorrect MedQA and MedMCQA items revealed four failure modes. Knowledge gaps—questions about rare diseases or recent guidelines—accounted for 41 % of errors. Reasoning errors, where the model applied incorrect logic, comprised 38 %. Question‑interpretation failures, such as misreading negation, made up 15 %, and formatting failures were 6 %. The low formatting error rate indicates that the deterministic prompt reliably produces the expected output format.

Deployment metrics were measured on a dual‑RTX 4090 GPU system. The quantized model occupies 28.1 GB of VRAM. Median inference latency is 0.84 seconds per query without RAG; adding RAG increases latency to 1.47 seconds due to embedding and vector search. The RAG index size is 312 MB, and the overall system fits within the hardware commonly available to hospital IT departments. The authors also describe governance features that support on‑premises use: role‑based access control, audit logging of request metadata and retrieved passage identifiers, model and index versioning, and incident‑response procedures.

The study does not include clinical trials or user studies, and the authors note that the results should not be interpreted as evidence of clinical effectiveness. The system is single‑modal, handling only text, and does not ingest imaging or waveform data. Future work outlined by the authors includes prospective studies of clinician workflow impact, expanded error analysis with multiple reviewers, and evaluation of the system on real clinical tasks.

In summary, AIDx‑Copilot demonstrates that a moderately sized LLM, fine‑tuned on realistic EHR data, can achieve benchmark performance comparable to larger open‑source models while remaining deployable on commodity hardware. The optional RAG component offers a small but measurable benefit for fact‑heavy questions. The authors provide detailed documentation of training procedures, evaluation protocols, and deployment architecture, enabling other institutions to replicate the approach.

AIDx: A Locally Deployable AI Assistant Shows Strong Benchmark Performance on Medical Question-Answering Tasks

Latest AI Stories

McKinsey Calls for Governments to Move Beyond AI Pilots to Transform Public Services

Shanghai Electric Unveils New Humanoid Robots and AI-Native Factory Framework at WAIC 2026

Clark State College Secures $100,000 Grant to Advance AI Literacy Across Ohio Community Colleges

Meta and Snapchat Push AI-Powered Glasses Amid Growing Privacy Backlash

OpenAI Faces Local Opposition as $20B Data Center Plan Unveiled in Effingham County, Georgia

Independent Musicians Join Class Action Against AI Music Platforms Suno and Udio

Financial Firms Ramp Up AI Spending, But ROI Measurement Lags Behind

Open-Source AI Models: What "Open" Really Means for Large Language Models

Cigna Expands AI-Enabled Predictive Tools to Cut Chronic-Condition Costs by $200 Million