AI-powered surveillance of bronchiolitis in the Nirsevimab era: comparative performance of machine learning, deep learning, and large language models on free-text ED records

Costa, Marianna; Khan, Mohd Rashid; Vedovelli, Luca; Gregori, Dario; Azzolina, Danila; Bressan, Silvia

doi:10.1186/s12873-025-01468-6

Background: Bronchiolitis remains a frequent reason for hospitalization in infants during the winter season. Epidemiologic surveillance remains crucial in the era of widespread immunoprophylaxis for the leading viral agent causing bronchiolitis. We investigated the performance of classical machine learning (ML) models, Deep Learning (DL), and a pre-trained large language model (LLM) in classifying bronchiolitis diagnosis from the free-text-diagnosis field of the emergency department electronic health records (EHRs). As a secondary aim, we evaluated the diagnostic accuracy of the actual official administrative ICD-9 encoding for Bronchiolitis diagnosis. Methods: 28,557 records of infants < 1 year with complete discharge diagnoses fields were retrieved between the years 2007–2018 and manually classified by an expert pediatrician to create the gold standard diagnosis set for training the algorithm. After data pre-processing, classical ML models (Random Forest, Decision Tree, Gradient Boosting Machine, Linear Discriminant Analysis, Support Vector Machine), a Deep Learning (DL) tool, and a pre-trained LLM (GPT-5) were evaluated using balanced accuracy, sensitivity, and F1 scores. The official administrative ICD-9 encoding classification accuracy was compared to the gold standard. Results: Overall, 1,903 of 28,557 records (6.7%) were classified as bronchiolitis by the gold standard approach. The DL model and GPT-5 outperformed traditional ML models, achieving higher sensitivities (0.97, 95%CI 0.96-1.00, and 0.98, 95% CI 0.98–0.99, respectively), F1 scores (0.96, 95% CI 0.95–0.99, and 0.99, 95% CI 0.98–0.99, respectively), and balanced accuracy (0.98, 95%CI 0.98-1.00, and 0.99, 95% CI 0.99–0.99, respectively). Traditional ML models showed sensitivities between 0.77 and 0.98, F1 scores between 0.86 and 0.96, and balanced accuracies between 0.88 and 0.96. ICD-9 codes showed sensitivity of 85.9% (95% CI 84.27–87.45), and specificity of 98.5% (95% CI 98.36–98.65). Conclusion: To our knowledge, this is the first study directly comparing an LLM, deep learning, and multiple classical ML models for bronchiolitis surveillance in the post-Nirsevimab era. DL and GPT-5 outperformed traditional ML-based tools in identifying bronchiolitis diagnoses and ICD-9 diagnosis coding. AI-based tools hold significant potential for improving epidemiologic surveillance of bronchiolitis from emergency department EHRs. Clinical trial number: Not applicable.

AI-powered surveillance of bronchiolitis in the Nirsevimab era: comparative performance of machine learning, deep learning, and large language models on free-text ED records / Costa, M., Khan, M.R., Vedovelli, L., Gregori, D., Azzolina, D., Bressan, S.. - In: BMC EMERGENCY MEDICINE. - ISSN 1471-227X. - 26:1(2026). [10.1186/s12873-025-01468-6]