The United States emergency department sees almost 400,000 patients each day, a volume that strains wait times and fuels clinician burnout. Artificial‑intelligence tools that can slip into existing electronic health‑record (EHR) workflows promise relief, yet most large language models (LLMs) are too massive for on‑premises use and rarely integrate with clinical systems.

A new study introduces AIDx, an integrated AI system that runs entirely on hospital infrastructure. Its core component, AIDx‑Copilot, is a Mixtral‑8x7B‑Instruct LLM fine‑tuned on de‑identified clinical notes from the MIMIC‑IV database. The training pipeline transforms each patient chart into a series of time‑stamped snapshots that preserve the chronological order of events and guard against future‑information leakage. From these snapshots, researchers generated roughly eight million question–answer pairs that mirror the kinds of clinical inquiries a physician might pose.

To gauge performance, the authors employed the MultiMedQA benchmark suite, which aggregates nine public medical QA datasets, including USMLE‑style questions and specialty‑specific exams. The evaluation followed a deterministic, single‑pass protocol: temperature set to zero, no chain‑of‑thought or voting, and a strict prompt that required the model to return a single letter answer. Accuracy was measured by exact match with the reference label, and a Wilson 95 % confidence interval was reported for each dataset.

AIDx‑Copilot achieved a mean accuracy of 83.6 % across the nine datasets. The highest scores appeared on MMLU Professional Medicine (93.4 %) and MMLU Clinical Knowledge (90.0 %). The lowest performance was on MedMCQA (70.7 %) and MMLU Anatomy (78.1 %). The study compared four configurations: the base Mixtral model, the base model with retrieval‑augmented generation (RAG) from an open‑access medical textbook index, the fine‑tuned model without RAG, and the fine‑tuned model with RAG. Fine‑tuning alone raised mean accuracy by 17.8 percentage points over the base model, while adding RAG delivered a modest average gain of 0.4 percentage points, with larger improvements on datasets that rely on factual recall.

An error analysis of 200 randomly selected incorrect MedQA and MedMCQA items revealed four failure modes. Knowledge gaps—questions about rare diseases or recent guidelines—accounted for 41 % of errors. Reasoning errors, where the model applied incorrect logic, comprised 38 %. Question‑interpretation failures, such as misreading negation, made up 15 %, and formatting failures were 6 %. The low formatting error rate indicates that the deterministic prompt reliably produces the expected output format.

Deployment metrics were measured on a dual‑RTX 4090 GPU system. The quantized model occupies 28.1 GB of VRAM. Median inference latency is 0.84 seconds per query without RAG; adding RAG increases latency to 1.47 seconds due to embedding and vector search. The RAG index size is 312 MB, and the overall system fits within the hardware commonly available to hospital IT departments. The authors also describe governance features that support on‑premises use: role‑based access control, audit logging of request metadata and retrieved passage identifiers, model and index versioning, and incident‑response procedures.

The study does not include clinical trials or user studies, and the authors note that the results should not be interpreted as evidence of clinical effectiveness. The system is single‑modal, handling only text, and does not ingest imaging or waveform data. Future work outlined by the authors includes prospective studies of clinician workflow impact, expanded error analysis with multiple reviewers, and evaluation of the system on real clinical tasks.

In summary, AIDx‑Copilot demonstrates that a moderately sized LLM, fine‑tuned on realistic EHR data, can achieve benchmark performance comparable to larger open‑source models while remaining deployable on commodity hardware. The optional RAG component offers a small but measurable benefit for fact‑heavy questions. The authors provide detailed documentation of training procedures, evaluation protocols, and deployment architecture, enabling other institutions to replicate the approach.