File size: 3,522 Bytes

---
license: mit
tags:
  - chest-xray
  - medical
  - multimodal
  - retrieval
  - explanation
  - clinicalbert
  - swin-transformer
  - deep-learning
  - image-text
datasets:
  - openi
language:
  - en
---

# Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin)

This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:

- Predict medical conditions from multimodal input (image + text)
- Retrieve similar cases using shared disease-aware embeddings
- Provide visual explanations using attention and Integrated Gradients (IG)

> Developed as a final project at HCMUS.

---

## Model Architecture

- **Image Encoder:** Swin Transformer (pretrained, fine-tuned)
- **Text Encoder:** ClinicalBERT
- **Fusion Module:** Cross-modal attention with optional hybrid FFN layers
- **Losses:** BCE + Focal Loss for multi-label classification

Embeddings from both modalities are projected into a **shared joint space**, enabling retrieval and explanation.

---

## Training Data

- **Dataset:** [NIH Open-i Chest X-ray Dataset](https://openi.nlm.nih.gov/)
- **Input Modalities:**
  - Chest X-ray DICOMs
  - Associated XML radiology reports
- **Labels:** MeSH-derived disease categories (multi-label)

---

## Intended Uses
* Clinical Education: Case similarity search for radiology students

* Research: Baseline for multimodal medical retrieval

* Explainability: Visualize disease evidence in both image and text

## Model Performance

### Classification

The model was evaluated on a held-out **evaluation set** and a **separate test set** across 22 disease labels. Performance metrics include **Precision (Prec)**, **Recall (Rec)**, **F1-score**, and **AUROC**.

| Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) |
|--------|--------------------|--------------------|
| Precision | 0.826 | 0.825 |
| Recall    | 0.829 | 0.812 |
| F1-score  | 0.825 | 0.800 |
| AUROC     | 0.924 | 0.943 |

*The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.*

---

### Retrieval Performance

Retrieval was evaluated under two protocols:

| Protocol | P@5 | mAP | MRR | Avg Time (ms) |
|----------|-----|-----|-----|---------------|
| Generalization (test → test) | 0.776 | 0.0058 | 0.848 | 0.99 |
| Historical (test → train)    | 0.794 | 0.0008 | 0.881 | 2.19 |

#### Retrieval Diversity

| Metric | Mean | Std. Dev | Median |
|--------|------|----------|--------|
| Retrieval Diversity Score | 0.217 | 0.041 | 0.222 |
| Retrieval Overlap IoU@5    | 0.783 | 0.041 | 0.778 |

*The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.*

---

### Notes

- Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
- Lower performance on some rare labels may reflect dataset imbalance in Open-i.

---

## Limitations & Risks
* Trained on a public dataset (Open-i) — may not generalize to other hospitals

* Explanations are not clinically validated

* Not for diagnostic use in real-world settings

---

## Acknowledgments
* [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection)

* Swin Transformer (Timm)

* ClinicalBERT (Emily Alsentzer)

* Captum (for IG explanations)

* Gam-CAM

## Code link: [GitHub](https://github.com/ppddddpp/multi-modal-retrieval-predict-project)