Update README.md
Browse files
README.md
CHANGED
@@ -56,6 +56,50 @@ Embeddings from both modalities are projected into a **shared joint space**, ena
|
|
56 |
|
57 |
* Explainability: Visualize disease evidence in both image and text
|
58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
## Limitations & Risks
|
60 |
* Trained on a public dataset (Open-i) — may not generalize to other hospitals
|
61 |
|
@@ -63,8 +107,10 @@ Embeddings from both modalities are projected into a **shared joint space**, ena
|
|
63 |
|
64 |
* Not for diagnostic use in real-world settings
|
65 |
|
|
|
|
|
66 |
## Acknowledgments
|
67 |
-
* NIH Open-i Dataset
|
68 |
|
69 |
* Swin Transformer (Timm)
|
70 |
|
|
|
56 |
|
57 |
* Explainability: Visualize disease evidence in both image and text
|
58 |
|
59 |
+
## Model Performance
|
60 |
+
|
61 |
+
### Classification
|
62 |
+
|
63 |
+
The model was evaluated on a held-out **evaluation set** and a **separate test set** across 22 disease labels. Performance metrics include **Precision (Prec)**, **Recall (Rec)**, **F1-score**, and **AUROC**.
|
64 |
+
|
65 |
+
| Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) |
|
66 |
+
|--------|--------------------|--------------------|
|
67 |
+
| Precision | 0.826 | 0.825 |
|
68 |
+
| Recall | 0.829 | 0.812 |
|
69 |
+
| F1-score | 0.825 | 0.800 |
|
70 |
+
| AUROC | 0.924 | 0.943 |
|
71 |
+
|
72 |
+
*The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.*
|
73 |
+
|
74 |
+
---
|
75 |
+
|
76 |
+
### Retrieval Performance
|
77 |
+
|
78 |
+
Retrieval was evaluated under two protocols:
|
79 |
+
|
80 |
+
| Protocol | P@5 | mAP | MRR | Avg Time (ms) |
|
81 |
+
|----------|-----|-----|-----|---------------|
|
82 |
+
| Generalization (test → test) | 0.776 | 0.0058 | 0.848 | 0.99 |
|
83 |
+
| Historical (test → train) | 0.794 | 0.0008 | 0.881 | 2.19 |
|
84 |
+
|
85 |
+
#### Retrieval Diversity
|
86 |
+
|
87 |
+
| Metric | Mean | Std. Dev | Median |
|
88 |
+
|--------|------|----------|--------|
|
89 |
+
| Retrieval Diversity Score | 0.217 | 0.041 | 0.222 |
|
90 |
+
| Retrieval Overlap IoU@5 | 0.783 | 0.041 | 0.778 |
|
91 |
+
|
92 |
+
*The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.*
|
93 |
+
|
94 |
+
---
|
95 |
+
|
96 |
+
### Notes
|
97 |
+
|
98 |
+
- Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
|
99 |
+
- Lower performance on some rare labels may reflect dataset imbalance in Open-i.
|
100 |
+
|
101 |
+
---
|
102 |
+
|
103 |
## Limitations & Risks
|
104 |
* Trained on a public dataset (Open-i) — may not generalize to other hospitals
|
105 |
|
|
|
107 |
|
108 |
* Not for diagnostic use in real-world settings
|
109 |
|
110 |
+
---
|
111 |
+
|
112 |
## Acknowledgments
|
113 |
+
* [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection)
|
114 |
|
115 |
* Swin Transformer (Timm)
|
116 |
|