File size: 3,522 Bytes
71b8e11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71d4535
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71b8e11
 
 
 
 
 
 
71d4535
 
71b8e11
71d4535
71b8e11
 
 
 
 
 
 
788ccb1
 
ccb63d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: mit
tags:
  - chest-xray
  - medical
  - multimodal
  - retrieval
  - explanation
  - clinicalbert
  - swin-transformer
  - deep-learning
  - image-text
datasets:
  - openi
language:
  - en
---

# Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin)

This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:

- Predict medical conditions from multimodal input (image + text)
- Retrieve similar cases using shared disease-aware embeddings
- Provide visual explanations using attention and Integrated Gradients (IG)

> Developed as a final project at HCMUS.

---

## Model Architecture

- **Image Encoder:** Swin Transformer (pretrained, fine-tuned)
- **Text Encoder:** ClinicalBERT
- **Fusion Module:** Cross-modal attention with optional hybrid FFN layers
- **Losses:** BCE + Focal Loss for multi-label classification

Embeddings from both modalities are projected into a **shared joint space**, enabling retrieval and explanation.

---

## Training Data

- **Dataset:** [NIH Open-i Chest X-ray Dataset](https://openi.nlm.nih.gov/)
- **Input Modalities:**
  - Chest X-ray DICOMs
  - Associated XML radiology reports
- **Labels:** MeSH-derived disease categories (multi-label)

---

## Intended Uses
* Clinical Education: Case similarity search for radiology students

* Research: Baseline for multimodal medical retrieval

* Explainability: Visualize disease evidence in both image and text

## Model Performance

### Classification

The model was evaluated on a held-out **evaluation set** and a **separate test set** across 22 disease labels. Performance metrics include **Precision (Prec)**, **Recall (Rec)**, **F1-score**, and **AUROC**.

| Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) |
|--------|--------------------|--------------------|
| Precision | 0.826 | 0.825 |
| Recall    | 0.829 | 0.812 |
| F1-score  | 0.825 | 0.800 |
| AUROC     | 0.924 | 0.943 |

*The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.*

---

### Retrieval Performance

Retrieval was evaluated under two protocols:

| Protocol | P@5 | mAP | MRR | Avg Time (ms) |
|----------|-----|-----|-----|---------------|
| Generalization (test → test) | 0.776 | 0.0058 | 0.848 | 0.99 |
| Historical (test → train)    | 0.794 | 0.0008 | 0.881 | 2.19 |

#### Retrieval Diversity

| Metric | Mean | Std. Dev | Median |
|--------|------|----------|--------|
| Retrieval Diversity Score | 0.217 | 0.041 | 0.222 |
| Retrieval Overlap IoU@5    | 0.783 | 0.041 | 0.778 |

*The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.*

---

### Notes

- Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
- Lower performance on some rare labels may reflect dataset imbalance in Open-i.

---

## Limitations & Risks
* Trained on a public dataset (Open-i) — may not generalize to other hospitals

* Explanations are not clinically validated

* Not for diagnostic use in real-world settings

---

## Acknowledgments
* [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection)

* Swin Transformer (Timm)

* ClinicalBERT (Emily Alsentzer)

* Captum (for IG explanations)

* Gam-CAM

## Code link: [GitHub](https://github.com/ppddddpp/multi-modal-retrieval-predict-project)