Ensemble Deep Learning Derived from Transfer Learning for Classification of COVID-19 Patients on Hybrid Deep-Learning-Based Lung Segmentation: A Data Augmentation and Balancing Framework

Dubey, Arun Kumar; Chabert, Gian Luca; Carriero, Alessandro; Pasche, Alessio; Danna, Pietro S C; Agarwal, Sushant; Mohanty, Lopamudra; Nillmani, Null; Sharma, Neeraj; Yadav, Sarita; Jain, Achin; Kumar, Ashish; Kalra, Mannudeep K; Sobel, David W; Laird, John R; Singh, Inder M; Singh, Narpinder; Tsoulfas, George; Fouda, Mostafa M; Alizad, Azra; Kitas, George D; Khanna, Narendra N; Viskovic, Klaudija; Kukuljan, Melita; Al-Maini, Mustafa; El-Baz, Ayman; Saba, Luca; Suri, Jasjit S

doi:10.3390/diagnostics13111954

Background and motivation: Lung computed tomography (CT) techniques are high-resolution and are well adopted in the intensive care unit (ICU) for COVID-19 disease control classification. Most artificial intelligence (AI) systems do not undergo generalization and are typically overfitted. Such trained AI systems are not practical for clinical settings and therefore do not give accurate results when executed on unseen data sets. We hypothesize that ensemble deep learning (EDL) is superior to deep transfer learning (TL) in both non-augmented and augmented frameworks. Methodology: The system consists of a cascade of quality control, ResNet-UNet-based hybrid deep learning for lung segmentation, and seven models using TL-based classification followed by five types of EDL's. To prove our hypothesis, five different kinds of data combinations (DC) were designed using a combination of two multicenter cohorts-Croatia (80 COVID) and Italy (72 COVID and 30 controls)-leading to 12,000 CT slices. As part of generalization, the system was tested on unseen data and statistically tested for reliability/stability. Results: Using the K5 (80:20) cross-validation protocol on the balanced and augmented dataset, the five DC datasets improved TL mean accuracy by 3.32%, 6.56%, 12.96%, 47.1%, and 2.78%, respectively. The five EDL systems showed improvements in accuracy of 2.12%, 5.78%, 6.72%, 32.05%, and 2.40%, thus validating our hypothesis. All statistical tests proved positive for reliability and stability. Conclusion: EDL showed superior performance to TL systems for both (a) unbalanced and unaugmented and (b) balanced and augmented datasets for both (i) seen and (ii) unseen paradigms, validating both our hypotheses.