Introduction:
Artificial intelligence (AI) holds immense promise as a diagnostic tool for clinicians. Here, we examine the utility of convolutional neural network models for the dermoscopic diagnosis of melanoma.
Methods:
We compare models trained on a dataset that included images from European and American sources (CNN-1) where one (SMARTI) had also been pre-trained on an Australian dataset. Dermoscopic images were collected from a prospectively recruited cohort of 210 lesions (from 191 patients) biopsied due to suspicion for melanoma. Each biopsy was diagnosed by five pathologists and compared to diagnoses of the two AI models.
Results:
CNN-1 yielded an area under the receiver-operator curve of 0.682 while SMARTI yielded 0.725. CNN-1 had a specificity of 0.35 (95% confidence interval (CI) 0.27-0.45) and sensitivity of 0.91 (CI 0.84-0.96). Whereas SMARTI demonstrated a specificity of 0.26 (CI 0.19-0.35) at a sensitivity of 0.95 (CI 0.88-0.98). We observed a higher inter-rater agreement for lesions correctly classified by SMARTI (Fleiss’ Kappa 0.788) relative to lesions misclassified by SMARTI (Fleiss’ Kappa 0.406). So, lesions misclassified by the AI model were also divisive for pathologists.
Conclusion:
These results demonstrate the impact of population relevant training data on the performance of dermoscopy AI. We find that lesions that were incorrectly diagnosed by AI also provoked disagreement between independent pathologists. This highlights the importance of incorporating multiple diagnosticians to establish ground truth for training datasets.