|
--- |
|
license: mit |
|
tags: |
|
- chest-xray |
|
- medical |
|
- multimodal |
|
- retrieval |
|
- explanation |
|
- clinicalbert |
|
- swin-transformer |
|
- deep-learning |
|
- image-text |
|
datasets: |
|
- openi |
|
language: |
|
- en |
|
--- |
|
|
|
# Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin) |
|
|
|
This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to: |
|
|
|
- Predict medical conditions from multimodal input (image + text) |
|
- Retrieve similar cases using shared disease-aware embeddings |
|
- Provide visual explanations using attention and Integrated Gradients (IG) |
|
|
|
> Developed as a final project at HCMUS. |
|
|
|
--- |
|
|
|
## Model Architecture |
|
|
|
- **Image Encoder:** Swin Transformer (pretrained, fine-tuned) |
|
- **Text Encoder:** ClinicalBERT |
|
- **Fusion Module:** Cross-modal attention with optional hybrid FFN layers |
|
- **Losses:** BCE + Focal Loss for multi-label classification |
|
|
|
Embeddings from both modalities are projected into a **shared joint space**, enabling retrieval and explanation. |
|
|
|
--- |
|
|
|
## Training Data |
|
|
|
- **Dataset:** [NIH Open-i Chest X-ray Dataset](https://openi.nlm.nih.gov/) |
|
- **Input Modalities:** |
|
- Chest X-ray DICOMs |
|
- Associated XML radiology reports |
|
- **Labels:** MeSH-derived disease categories (multi-label) |
|
|
|
--- |
|
|
|
## Intended Uses |
|
* Clinical Education: Case similarity search for radiology students |
|
|
|
* Research: Baseline for multimodal medical retrieval |
|
|
|
* Explainability: Visualize disease evidence in both image and text |
|
|
|
## Model Performance |
|
|
|
### Classification |
|
|
|
The model was evaluated on a held-out **evaluation set** and a **separate test set** across 22 disease labels. Performance metrics include **Precision (Prec)**, **Recall (Rec)**, **F1-score**, and **AUROC**. |
|
|
|
| Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) | |
|
|--------|--------------------|--------------------| |
|
| Precision | 0.826 | 0.825 | |
|
| Recall | 0.829 | 0.812 | |
|
| F1-score | 0.825 | 0.800 | |
|
| AUROC | 0.924 | 0.943 | |
|
|
|
*The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.* |
|
|
|
--- |
|
|
|
### Retrieval Performance |
|
|
|
Retrieval was evaluated under two protocols: |
|
|
|
| Protocol | P@5 | mAP | MRR | Avg Time (ms) | |
|
|----------|-----|-----|-----|---------------| |
|
| Generalization (test → test) | 0.776 | 0.0058 | 0.848 | 0.99 | |
|
| Historical (test → train) | 0.794 | 0.0008 | 0.881 | 2.19 | |
|
|
|
#### Retrieval Diversity |
|
|
|
| Metric | Mean | Std. Dev | Median | |
|
|--------|------|----------|--------| |
|
| Retrieval Diversity Score | 0.217 | 0.041 | 0.222 | |
|
| Retrieval Overlap IoU@5 | 0.783 | 0.041 | 0.778 | |
|
|
|
*The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.* |
|
|
|
--- |
|
|
|
### Notes |
|
|
|
- Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query. |
|
- Lower performance on some rare labels may reflect dataset imbalance in Open-i. |
|
|
|
--- |
|
|
|
## Limitations & Risks |
|
* Trained on a public dataset (Open-i) — may not generalize to other hospitals |
|
|
|
* Explanations are not clinically validated |
|
|
|
* Not for diagnostic use in real-world settings |
|
|
|
--- |
|
|
|
## Acknowledgments |
|
* [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection) |
|
|
|
* Swin Transformer (Timm) |
|
|
|
* ClinicalBERT (Emily Alsentzer) |
|
|
|
* Captum (for IG explanations) |
|
|
|
* Gam-CAM |
|
|
|
## Code link: [GitHub](https://github.com/ppddddpp/multi-modal-retrieval-predict-project) |
|
|