ppddddpp commited on
Commit
71d4535
·
verified ·
1 Parent(s): 788ccb1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -1
README.md CHANGED
@@ -56,6 +56,50 @@ Embeddings from both modalities are projected into a **shared joint space**, ena
56
 
57
  * Explainability: Visualize disease evidence in both image and text
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## Limitations & Risks
60
  * Trained on a public dataset (Open-i) — may not generalize to other hospitals
61
 
@@ -63,8 +107,10 @@ Embeddings from both modalities are projected into a **shared joint space**, ena
63
 
64
  * Not for diagnostic use in real-world settings
65
 
 
 
66
  ## Acknowledgments
67
- * NIH Open-i Dataset
68
 
69
  * Swin Transformer (Timm)
70
 
 
56
 
57
  * Explainability: Visualize disease evidence in both image and text
58
 
59
+ ## Model Performance
60
+
61
+ ### Classification
62
+
63
+ The model was evaluated on a held-out **evaluation set** and a **separate test set** across 22 disease labels. Performance metrics include **Precision (Prec)**, **Recall (Rec)**, **F1-score**, and **AUROC**.
64
+
65
+ | Metric | Eval Set (Macro Avg) | Test Set (Macro Avg) |
66
+ |--------|--------------------|--------------------|
67
+ | Precision | 0.826 | 0.825 |
68
+ | Recall | 0.829 | 0.812 |
69
+ | F1-score | 0.825 | 0.800 |
70
+ | AUROC | 0.924 | 0.943 |
71
+
72
+ *The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.*
73
+
74
+ ---
75
+
76
+ ### Retrieval Performance
77
+
78
+ Retrieval was evaluated under two protocols:
79
+
80
+ | Protocol | P@5 | mAP | MRR | Avg Time (ms) |
81
+ |----------|-----|-----|-----|---------------|
82
+ | Generalization (test → test) | 0.776 | 0.0058 | 0.848 | 0.99 |
83
+ | Historical (test → train) | 0.794 | 0.0008 | 0.881 | 2.19 |
84
+
85
+ #### Retrieval Diversity
86
+
87
+ | Metric | Mean | Std. Dev | Median |
88
+ |--------|------|----------|--------|
89
+ | Retrieval Diversity Score | 0.217 | 0.041 | 0.222 |
90
+ | Retrieval Overlap IoU@5 | 0.783 | 0.041 | 0.778 |
91
+
92
+ *The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.*
93
+
94
+ ---
95
+
96
+ ### Notes
97
+
98
+ - Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
99
+ - Lower performance on some rare labels may reflect dataset imbalance in Open-i.
100
+
101
+ ---
102
+
103
  ## Limitations & Risks
104
  * Trained on a public dataset (Open-i) — may not generalize to other hospitals
105
 
 
107
 
108
  * Not for diagnostic use in real-world settings
109
 
110
+ ---
111
+
112
  ## Acknowledgments
113
+ * [NIH Open-i Dataset](https://openi.nlm.nih.gov/faq#collection)
114
 
115
  * Swin Transformer (Timm)
116