ppddddpp
/

multi-modal-retrieval-predict

swin-transformer

Model card Files Files and versions

multi-modal-retrieval-predict / README.md

ppddddpp's picture

Update README.md

71d4535 verified 12 days ago

|

history blame contribute delete

3.52 kB

	---
	license: mit
	tags:
	- chest-xray
	- medical
	- multimodal
	- retrieval
	- explanation
	- clinicalbert
	- swin-transformer
	- deep-learning
	- image-text
	datasets:
	- openi
	language:
	- en
	---

	# Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin)

	This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:

	- Predict medical conditions from multimodal input (image + text)
	- Retrieve similar cases using shared disease-aware embeddings
	- Provide visual explanations using attention and Integrated Gradients (IG)

	> Developed as a final project at HCMUS.

	---

	## Model Architecture

	- Image Encoder: Swin Transformer (pretrained, fine-tuned)
	- Text Encoder: ClinicalBERT
	- Fusion Module: Cross-modal attention with optional hybrid FFN layers
	- Losses: BCE + Focal Loss for multi-label classification

	Embeddings from both modalities are projected into a shared joint space, enabling retrieval and explanation.

	---

	## Training Data

	- Dataset: [NIH Open-i Chest X-ray Dataset](https://openi.nlm.nih.gov/)
	- Input Modalities:
	- Chest X-ray DICOMs
	- Associated XML radiology reports
	- Labels: MeSH-derived disease categories (multi-label)

	---

	## Intended Uses
	* Clinical Education: Case similarity search for radiology students

	* Research: Baseline for multimodal medical retrieval

	* Explainability: Visualize disease evidence in both image and text

	## Model Performance

	### Classification

	The model was evaluated on a held-out evaluation set and a separate test set across 22 disease labels. Performance metrics include Precision (Prec), Recall (Rec), F1-score, and AUROC.

	\| Metric \| Eval Set (Macro Avg) \| Test Set (Macro Avg) \|
	\|--------\|--------------------\|--------------------\|
	\| Precision \| 0.826 \| 0.825 \|
	\| Recall \| 0.829 \| 0.812 \|
	\| F1-score \| 0.825 \| 0.800 \|
	\| AUROC \| 0.924 \| 0.943 \|

	The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.

	---

	### Retrieval Performance

	Retrieval was evaluated under two protocols:

	\| Protocol \| P@5 \| mAP \| MRR \| Avg Time (ms) \|
	\|----------\|-----\|-----\|-----\|---------------\|
	\| Generalization (test → test) \| 0.776 \| 0.0058 \| 0.848 \| 0.99 \|
	\| Historical (test → train) \| 0.794 \| 0.0008 \| 0.881 \| 2.19 \|

	#### Retrieval Diversity

	\| Metric \| Mean \| Std. Dev \| Median \|
	\|--------\|------\|----------\|--------\|
	\| Retrieval Diversity Score \| 0.217 \| 0.041 \| 0.222 \|
	\| Retrieval Overlap IoU@5 \| 0.783 \| 0.041 \| 0.778 \|

	The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.

	---

	### Notes

	- Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
	- Lower performance on some rare labels may reflect dataset imbalance in Open-i.

	---

	## Limitations & Risks
	* Trained on a public dataset (Open-i) — may not generalize to other hospitals

	* Explanations are not clinically validated

	* Not for diagnostic use in real-world settings

	---

	## Acknowledgments
	* [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection)

	* Swin Transformer (Timm)

	* ClinicalBERT (Emily Alsentzer)

	* Captum (for IG explanations)

	* Gam-CAM

	## Code link: [GitHub](https://github.com/ppddddpp/multi-modal-retrieval-predict-project)