Using a large language model as a third reviewer to augment dual human full-text screening in orthopaedic systematic reviews

Journal

Knee Surg Sports Traumatol Arthrosc

PMID

42159215

DOI

Abstract

Large language models (LLMs) are a form of artificial intelligence (AI) that have emerged as potential tools to augment systematic review workflows. This study aimed to evaluate GPT-5 as a third reviewer for full-text screening across orthopaedic subspecialties. Three review topics were selected. Python scripts were developed to call on the GPT-5 model via the OpenAI application programming interface (API) to perform full-text screening using predefined inclusion and exclusion criteria. Two human reviewers simultaneously performed screening based on the same criteria. Performance metrics such as specificity, sensitivity, accuracy, positive predictive value (PPV), negative predictive values (NPV), and F1 scores for GPT-5 were calculated based on a gold-standard inclusion and exclusion list developed by a third human adjudicator. Efficiency metrics included total cost and completion time. The number of full-texts screened were 35, 70 and 146 amongst the three review topics. For topic one, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 100% each. For topic two, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 93.3%, 98.2%, 93.3%, 98.2%, 97.1% and 93.3% respectively. For topic three, sensitivity, specificity, PPV, NPV, accuracy and F1 scores were 93.3%, 100%, 100%, 99.2%, 99.3% and 96.7%, respectively. Time to completion ranged between 18.1 and 58 min. Cost ranged from $0.84 to $3.29 USD. GPT-5 demonstrated high diagnostic accuracy as a third reviewer for full-text screening across three different subspecialties, with high agreement with final consensus adjudication decisions. These findings suggest that modern LLMs can potentially augment dual-review screening workflows by providing efficient decision-support while preserving methodological rigour. However, the small number of included studies within each topic resulted in wide confidence intervals, and additional validation across larger datasets are necessary. Not applicable.

Open at KSSTA →View on PubMed →

Preview Vancouver citation

Vivekanantha P, Son H, Bernardini L, Bouchard MD, Ayeni OR, Kay J. Using a large language model as a third reviewer to augment dual human full-text screening in orthopaedic systematic reviews. Knee Surg Sports Traumatol Arthrosc. 2026 May. doi:10.1002/ksa.70462. PMID: 42159215.

Metadata sourced from the U.S. National Library of Medicine (PubMed). OrthoGlobe curates but does not host the full-text article.

Abstract

Preview Vancouver citation

Metadata sourced from the U.S. National Library of Medicine (PubMed). OrthoGlobe curates but does not host the full-text article.