jo-mengr commited on
Commit
6661c10
·
verified ·
1 Parent(s): 28bcb58

Add new SentenceTransformer model

Browse files
0_MMContextEncoder/config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "text_encoder_name": "Qwen/Qwen3-Embedding-0.6B",
3
+ "adapter_hidden_dim": null,
4
+ "adapter_output_dim": null,
5
+ "freeze_text_encoder": true,
6
+ "unfreeze_last_n_layers": 1,
7
+ "registered_data_origin": "unregistered",
8
+ "registered_input_dim": null,
9
+ "output_token_embeddings": false,
10
+ "train_lookup": false,
11
+ "pooling_mode": "mean",
12
+ "joint_adapter_hidden_dim": null,
13
+ "_joint_adapter_was_trained": false,
14
+ "max_seq_length": 32768,
15
+ "text_model_kwargs": {}
16
+ }
0_MMContextEncoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2c1be0c7d4862f2cc3e43bfebf6bee6d4e45085b9b7f399fb0aa5870e559f7e0
3
+ size 2383143480
README.md ADDED
@@ -0,0 +1,719 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ tags:
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ - dense
9
+ - generated_from_trainer
10
+ - dataset_size:197351
11
+ - loss:MultipleNegativesRankingLoss
12
+ base_model: Qwen/Qwen3-Embedding-0.6B
13
+ widget:
14
+ - source_sentence: ABCB7
15
+ sentences:
16
+ - This gene encodes a tetrameric mitochondrial flavoprotein, which is a member of
17
+ the acyl-CoA dehydrogenase family. This enzyme catalyzes the initial step of the
18
+ mitochondrial fatty acid beta-oxidation pathway. Mutations in this gene have been
19
+ associated with short-chain acyl-CoA dehydrogenase (SCAD) deficiency. Alternative
20
+ splicing results in two variants which encode different isoforms. [provided by
21
+ RefSeq, Oct 2014]
22
+ - The membrane-associated protein encoded by this gene is a member of the superfamily
23
+ of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules
24
+ across extra- and intra-cellular membranes. ABC genes are divided into seven distinct
25
+ subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member
26
+ of the MDR/TAP subfamily. Members of the MDR/TAP subfamily are involved in multidrug
27
+ resistance as well as antigen presentation. This gene encodes a half-transporter
28
+ involved in the transport of heme from the mitochondria to the cytosol. With iron/sulfur
29
+ cluster precursors as its substrates, this protein may play a role in metal homeostasis.
30
+ Mutations in this gene have been associated with mitochondrial iron accumulation
31
+ and isodicentric (X)(q13) and sideroblastic anemia. Alternatively spliced transcript
32
+ variants encoding multiple isoforms have been observed for this gene. [provided
33
+ by RefSeq, Nov 2012]
34
+ - The membrane-associated protein encoded by this gene is a member of the superfamily
35
+ of ATP-binding cassette (ABC) transporters. ABC proteins transport various molecules
36
+ across extra- and intracellular membranes. ABC genes are divided into seven distinct
37
+ subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, and White). This encoded protein
38
+ is a member of the ABC1 subfamily. Members of the ABC1 subfamily comprise the
39
+ only major ABC subfamily found exclusively in multicellular eukaryotes. This gene
40
+ is clustered among 4 other ABC1 family members on 17q24, but neither the substrate
41
+ nor the function of this gene is known. Alternative splicing of this gene results
42
+ in several transcript variants; however, not all variants have been fully described.
43
+ [provided by RefSeq, Jul 2008]
44
+ - source_sentence: ABCC8
45
+ sentences:
46
+ - The protein encoded by this gene is a member of the superfamily of ATP-binding
47
+ cassette (ABC) transporters. ABC proteins transport various molecules across extra-
48
+ and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies
49
+ (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the
50
+ MRP subfamily which is involved in multi-drug resistance. This protein functions
51
+ as a modulator of ATP-sensitive potassium channels and insulin release. Mutations
52
+ in the ABCC8 gene and deficiencies in the encoded protein have been observed in
53
+ patients with hyperinsulinemic hypoglycemia of infancy, an autosomal recessive
54
+ disorder of unregulated and high insulin secretion. Mutations have also been associated
55
+ with non-insulin-dependent diabetes mellitus type II, an autosomal dominant disease
56
+ of defective insulin secretion. Alternatively spliced transcript variants have
57
+ been found for this gene. [provided by RefSeq, Jul 2020]
58
+ - Predicted to enable GTPase activator activity and zinc ion binding activity. Predicted
59
+ to be involved in protein transport. Located in membrane. [provided by Alliance
60
+ of Genome Resources, Jul 2025]
61
+ - The protein encoded by this gene is a member of the superfamily of ATP-binding
62
+ cassette (ABC) transporters. ABC proteins transport various molecules across extra-
63
+ and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies
64
+ (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This ABC full transporter is a
65
+ member of the MRP subfamily which is involved in multi-drug resistance. The product
66
+ of this gene participates in physiological processes involving bile acids, conjugated
67
+ steroids, and cyclic nucleotides. In addition, a SNP in this gene is responsible
68
+ for determination of human earwax type. This gene and family member ABCC12 are
69
+ determined to be derived by duplication and are both localized to chromosome 16q12.1.
70
+ Multiple alternatively spliced transcript variants have been described for this
71
+ gene. [provided by RefSeq, Jul 2008]
72
+ - source_sentence: MALAT1 TMSB4X ACTB TPT1 EEF1A1 S100A10 LGALS1 VIM SH3BGRL3 S100A4
73
+ FTL PTMA SRGN TMSB10 CYBA GAPDH CD74 TAGLN2 FTH1 S100A6 UBA52 YBX1 MYL6 OAZ1 CST3
74
+ NACA FAU ARPC2 GSTP1 PFN1 HSP90AA1 COTL1 PPIA ARPC3 UQCRB MYL12A CD63 EIF1 NEAT1
75
+ RACK1 MACROH2A1 ATP6V0E1 ATP5F1E SRP14 ENO1 SLC25A3 CTSH PRDX1 VAMP8 COX4I1 CAP1
76
+ BTF3 DBI HNRNPA3 GNAS DDX5 H3-3B TPM3 LAPTM5 ZEB2 GNG5 FLNA CALM1 CD44
77
+ sentences:
78
+ - MALAT1 PTMA TMSB10 LGALS1 ACTB PRDX1 S100A4 H3-3B TMSB4X VIM TPT1 LMO4 HNRNPA2B1
79
+ SH3BGRL3 TAGLN2 HNRNPU DDIT4 PFN1 IGFBP7 HMGB1 FTH1 CFL1 CD74 SOX4 KLF2 BST2 S100A11
80
+ RACK1 PSMA4 DDX5 NCL RSRP1 IRF1 SERF2 EEF1A1 CALM1 UBA52 CYBA HSP90AA1 MYL12A
81
+ AHNAK ITM2B SRP14 EMP3 CALM2 TSC22D3 YWHAZ SELENOW PPIA S100A6 TSPO IRAG2 TPM3
82
+ UBC ARPC2 HNRNPA3 UBB EIF1 JUN IFITM2 PRR13 N4BP2L2 LAPTM4A CDC42
83
+ - This measurement was conducted with 10x 3' v3. This sample is derived from a 3-month-old
84
+ male patient with KMT2A-rearranged (KMT2A-r) infant acute lymphoblastic leukemia
85
+ (ALL) with a CD8_Cytotoxic T cell type, specifically T/NK cells, and a presumed
86
+ MLL-AF4 fusion.
87
+ - This measurement was conducted with 10x 3' v3. Blast cells derived from a 1-month-old
88
+ human with a presumed MLL-AF10 fusion, projected as cDC-like cells.
89
+ - source_sentence: MALAT1 CXCL14 EEF1A1 VIM IGFBP7 COL1A2 FTH1 TPT1 S100A6 TMSB4X
90
+ A2M APOE DCN PTGDS TMSB10 LGALS1 ACTB FBLN1 FTL RARRES2 CD81 CALD1 CD63 COL6A2
91
+ MYL6 SPARCL1 NEAT1 IGFBP5 PTMA CST3 FAU SERF2 SPARC IFITM3 EIF1 S100A4 NACA JUND
92
+ COL6A1 GSN C1S CFH HSP90AA1 PDLIM1 H3-3B EDIL3 UBA52 VCAN LTBP4 TIMP3 CTSC ITM2B
93
+ IGFBP4 UBC UBB RACK1 TIMP1 ACTA2 ZFP36L2 PLPP3 TUBA1A FILIP1L FOS S100A10
94
+ sentences:
95
+ - MALAT1 TMSB10 A2M FABP5 PTMA VIM ACTB CAV1 SPARCL1 CD74 EEF1A1 KLF2 IFITM3 CLDN5
96
+ TMSB4X TPT1 ENPP2 TM4SF1 FOS EIF1 S100A6 CALM1 CD81 HES1 SRGN ID1 GNG11 IGFBP4
97
+ STOM GSN TAGLN2 IGFBP7 CD320 FTH1 MCAM HSP90AA1 GNAS MYL6 TIMP3 EPAS1 TNFSF10
98
+ PODXL ITM2B SRP14 UBC TGFBR2 KCTD12 GIMAP7 UBA52 RHOA CD59 FTL PCSK5 MYH9 MYL12A
99
+ FLT1 CXCL12 LIFR TUBA1B DSTN ARPC1B JUND H3-3B TMBIM6
100
+ - This measurement was conducted with 10x 3' v3. Fibroblasts derived from the terminal
101
+ ileum of a female individual in her fourth decade, exhibiting Crohn's disease
102
+ (CD) related changes.
103
+ - This measurement was conducted with 10x 3' v3. Glial cells derived from the ileal
104
+ epithelium of a female in her fourth decade.
105
+ - source_sentence: MALAT1 DCN MGP APOD GSN LAMA2 CST3 SPARCL1 IGFBP7 TIMP1 VIM EEF1A1
106
+ ITM2B FBLN1 C3 IFITM3 FBN1 FTH1 TPT1 ABCA8 C1S TXNIP FTL TIMP3 FN1 CD63 RBMS3
107
+ ABCA6 ZBTB20 CEBPD NEAT1 CFH VCAN PTN PTGDS CD81 SERF2 COL6A1 COL6A2 ABI3BP ABCA10
108
+ EBF1 COL1A2 PRKG1 S100A6 MGST1 TMSB10 TIMP2 CELF2 LAPTM4A RORA ACTB LTBP4 MYL6
109
+ LGALS1 DDX5 SPTBN1 EFEMP1 BICC1 LRP1 H3-3B SCN7A IGFBP4 FAU
110
+ sentences:
111
+ - This measurement was conducted with 10x 3' v3. CD4+T naive lymphocyte cells derived
112
+ from the right cardiac atrium of a European male in his sixties.
113
+ - This measurement was conducted with 10x multiome. Fibroblast cell sample taken
114
+ from the right ventricle of a European female donor in her fifth decade, who is
115
+ a DCD donor. The sample is in nucleus form.
116
+ - MALAT1 NEAT1 LINC00486 SLC8A1 VMP1 SAT1 PIK3R5 DIRC3 FMN1 PMP22 RBM47 AGFG1 DIP2B
117
+ RBMS1 GNAQ TBC1D14 RAB1A ARHGAP24 DAPK1 SLC1A3 RHOQ SH3BGRL DOCK10 SLCO2B1 RUNX1
118
+ ENOX2 LDLRAD4 RNF150 PIAS1 DDX5 WSB1 TSHZ3 SBF2 DOCK2 LRP4 DENND4C FCHSD2 EXOC6B
119
+ AFF3 ARHGAP26 DIAPH2 MGAT5 TMEM163 NSMCE2 RBPJ ZEB2 TANC2 BPTF SH3RF3 MFSD14CP
120
+ TCF4 RORA-AS1 NOP58 MEF2A EPN2 PICALM ARHGAP15 MEF2C ANKRD12 FCGRT DOCK8 SETX
121
+ TBC1D9 KLHL2
122
+ datasets:
123
+ - jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation
124
+ - jo-mengr/descriptions_genes
125
+ pipeline_tag: sentence-similarity
126
+ library_name: sentence-transformers
127
+ metrics:
128
+ - cosine_accuracy
129
+ model-index:
130
+ - name: SentenceTransformer based on Qwen/Qwen3-Embedding-0.6B
131
+ results:
132
+ - task:
133
+ type: triplet
134
+ name: Triplet
135
+ dataset:
136
+ name: cellxgene pseudo bulk 100k multiplets natural language annotation cell
137
+ sentence 2
138
+ type: cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_2
139
+ metrics:
140
+ - type: cosine_accuracy
141
+ value: 0.8204416632652283
142
+ name: Cosine Accuracy
143
+ - task:
144
+ type: triplet
145
+ name: Triplet
146
+ dataset:
147
+ name: gene description
148
+ type: gene_description
149
+ metrics:
150
+ - type: cosine_accuracy
151
+ value: 0.9559999704360962
152
+ name: Cosine Accuracy
153
+ ---
154
+
155
+ # SentenceTransformer based on Qwen/Qwen3-Embedding-0.6B
156
+
157
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on the [cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation) and [gene_description](https://huggingface.co/datasets/jo-mengr/descriptions_genes) datasets. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
158
+
159
+ ## Model Details
160
+
161
+ ### Model Description
162
+ - **Model Type:** Sentence Transformer
163
+ - **Base model:** [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) <!-- at revision c54f2e6e80b2d7b7de06f51cec4959f6b3e03418 -->
164
+ - **Maximum Sequence Length:** 32768 tokens
165
+ - **Output Dimensionality:** 1024 dimensions
166
+ - **Similarity Function:** Cosine Similarity
167
+ - **Training Datasets:**
168
+ - [cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation)
169
+ - [gene_description](https://huggingface.co/datasets/jo-mengr/descriptions_genes)
170
+ - **Language:** code
171
+ <!-- - **License:** Unknown -->
172
+
173
+ ### Model Sources
174
+
175
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
176
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
177
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
178
+
179
+ ### Full Model Architecture
180
+
181
+ ```
182
+ SentenceTransformer(
183
+ (0): MMContextEncoder(
184
+ (text_encoder): Qwen3Model(
185
+ (embed_tokens): Embedding(151669, 1024)
186
+ (layers): ModuleList(
187
+ (0-27): 28 x Qwen3DecoderLayer(
188
+ (self_attn): Qwen3Attention(
189
+ (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
190
+ (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
191
+ (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
192
+ (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
193
+ (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
194
+ (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
195
+ )
196
+ (mlp): Qwen3MLP(
197
+ (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
198
+ (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
199
+ (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
200
+ (act_fn): SiLU()
201
+ )
202
+ (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
203
+ (post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
204
+ )
205
+ )
206
+ (norm): Qwen3RMSNorm((1024,), eps=1e-06)
207
+ (rotary_emb): Qwen3RotaryEmbedding()
208
+ )
209
+ (pooling): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
210
+ )
211
+ )
212
+ ```
213
+
214
+ ## Usage
215
+
216
+ ### Direct Usage (Sentence Transformers)
217
+
218
+ First install the Sentence Transformers library:
219
+
220
+ ```bash
221
+ pip install -U sentence-transformers
222
+ ```
223
+
224
+ Then you can load this model and run inference.
225
+ ```python
226
+ from sentence_transformers import SentenceTransformer
227
+
228
+ # Download from the 🤗 Hub
229
+ model = SentenceTransformer("jo-mengr/mmcontext-qwen-scvi_fm")
230
+ # Run inference
231
+ sentences = [
232
+ 'MALAT1 DCN MGP APOD GSN LAMA2 CST3 SPARCL1 IGFBP7 TIMP1 VIM EEF1A1 ITM2B FBLN1 C3 IFITM3 FBN1 FTH1 TPT1 ABCA8 C1S TXNIP FTL TIMP3 FN1 CD63 RBMS3 ABCA6 ZBTB20 CEBPD NEAT1 CFH VCAN PTN PTGDS CD81 SERF2 COL6A1 COL6A2 ABI3BP ABCA10 EBF1 COL1A2 PRKG1 S100A6 MGST1 TMSB10 TIMP2 CELF2 LAPTM4A RORA ACTB LTBP4 MYL6 LGALS1 DDX5 SPTBN1 EFEMP1 BICC1 LRP1 H3-3B SCN7A IGFBP4 FAU',
233
+ 'This measurement was conducted with 10x multiome. Fibroblast cell sample taken from the right ventricle of a European female donor in her fifth decade, who is a DCD donor. The sample is in nucleus form.',
234
+ "This measurement was conducted with 10x 3' v3. CD4+T naive lymphocyte cells derived from the right cardiac atrium of a European male in his sixties.",
235
+ ]
236
+ embeddings = model.encode(sentences)
237
+ print(embeddings.shape)
238
+ # [3, 1024]
239
+
240
+ # Get the similarity scores for the embeddings
241
+ similarities = model.similarity(embeddings, embeddings)
242
+ print(similarities)
243
+ # tensor([[1.0000, 0.6280, 0.0951],
244
+ # [0.6280, 1.0000, 0.2002],
245
+ # [0.0951, 0.2002, 1.0000]])
246
+ ```
247
+
248
+ <!--
249
+ ### Direct Usage (Transformers)
250
+
251
+ <details><summary>Click to see the direct usage in Transformers</summary>
252
+
253
+ </details>
254
+ -->
255
+
256
+ <!--
257
+ ### Downstream Usage (Sentence Transformers)
258
+
259
+ You can finetune this model on your own dataset.
260
+
261
+ <details><summary>Click to expand</summary>
262
+
263
+ </details>
264
+ -->
265
+
266
+ <!--
267
+ ### Out-of-Scope Use
268
+
269
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
270
+ -->
271
+
272
+ ## Evaluation
273
+
274
+ ### Metrics
275
+
276
+ #### Triplet
277
+
278
+ * Datasets: `cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_2` and `gene_description`
279
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
280
+
281
+ | Metric | cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_2 | gene_description |
282
+ |:--------------------|:----------------------------------------------------------------------------------|:-----------------|
283
+ | **cosine_accuracy** | **0.8204** | **0.956** |
284
+
285
+ <!--
286
+ ## Bias, Risks and Limitations
287
+
288
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
289
+ -->
290
+
291
+ <!--
292
+ ### Recommendations
293
+
294
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
295
+ -->
296
+
297
+ ## Training Details
298
+
299
+ ### Training Datasets
300
+
301
+ #### cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation
302
+
303
+ * Dataset: [cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation) at [d518eb2](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation/tree/d518eb24af305653b43acd9e26f9502632059e7c)
304
+ * Size: 81,143 training samples
305
+ * Columns: <code>anchor</code>, <code>positive</code>, <code>negative_1</code>, and <code>negative_2</code>
306
+ * Approximate statistics based on the first 1000 samples:
307
+ | | anchor | positive | negative_1 | negative_2 |
308
+ |:--------|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|
309
+ | type | string | string | string | string |
310
+ | details | <ul><li>min: 356 characters</li><li>mean: 385.24 characters</li><li>max: 450 characters</li></ul> | <ul><li>min: 92 characters</li><li>mean: 216.13 characters</li><li>max: 900 characters</li></ul> | <ul><li>min: 103 characters</li><li>mean: 212.72 characters</li><li>max: 1186 characters</li></ul> | <ul><li>min: 353 characters</li><li>mean: 384.82 characters</li><li>max: 433 characters</li></ul> |
311
+ * Samples:
312
+ | anchor | positive | negative_1 | negative_2 |
313
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
314
+ | <code>TMSB4X TMSB10 ACTB MALAT1 GNLY NKG7 IFITM2 LGALS1 GZMA EEF1A1 PFN1 HMGB2 FTH1 PTMA HSP90AA1 GZMB ARHGDIB HNRNPA2B1 PLAAT4 FAU CMC1 VIM MYL12A CBX3 ATP5F1E HCST IFI44L KLRF1 H3-3A COX6C ARL6IP1 CFL1 ISG15 HMGB1 S100A4 ATP5MF RORA MYL6 CORO1A OAZ1 KLRB1 ID2 HMGN3 CCNI RBM39 CAP1 SERF2 ELOC FCER1G S100A9 IFI16 YWHAZ EIF1 CALR HMGN2 SKAP2 SLC25A5 ZZZ3 YBX1 NUCB2 CDC42 GSTP1 FTL ATP5F1D</code> | <code>This measurement was conducted with 10x 3' v2. A proliferating lymphocyte cell sample, obtained from a 34-year-old female Asian individual, derived from peripheral blood mononuclear cells.</code> | <code>This measurement was conducted with 10x 3' v2. Sample is a CD8-positive, alpha-beta T cell derived from a 31-year-old Asian female's peripheral blood mononuclear cells.</code> | <code>MALAT1 TMSB4X EEF1A1 TMSB10 FAU TPT1 PTMA EIF1 UBA52 ACTB FTH1 RACK1 FTL H3-3B JUNB ATP5F1E BTG1 CD52 NACA MYL12A PFN1 COX7C COX4I1 SERF2 UQCRB TOMM7 IL32 YBX1 PABPC1 MYL6 EIF3E OAZ1 NOP53 ARHGDIB LDHB HCST SARAF ITM2B ATP6V1G1 SRP14 UBC H3-3A COX6C HINT1 UBB COMMD6 S100A4 S100A6 CALM1 VIM CYBA ENO1 HSP90AA1 FXYD5 HSP90AB1 CIRBP SRSF5 NFKBIA CORO1A LEPROTL1 TLE5 CHCHD2 DDX5 CD69</code> |
315
+ | <code>EEF1A1 MALAT1 FTH1 JUNB TPT1 FOS TMSB10 BTG1 TMSB4X ZFP36L2 NACA PABPC1 ACTB FAU VIM H3-3B EIF1 ZFP36 SARAF PTMA IL7R JUN RACK1 EEF2 UBA52 GAPDH FTL FXYD5 DUSP1 S100A4 CD69 CXCR4 UBC TSC22D3 CFL1 KLF6 ARHGDIB KLF2 BTG2 CITED2 IER2 TUBB4B CD3E EEF1G SLC2A3 NFKBIA PFN1 SRGN SNX9 COX4I1 DNAJB1 SERF2 CD8A PCBP2 IL32 BIRC3 SMAP2 FUS GADD45B MYL12A OAZ1 ATP5F1E TUBA4A PNRC1</code> | <code>This measurement was conducted with 10x 5' v1. Sample is a cell from the omentum tissue, specifically an effector memory CD4-positive, alpha-beta T cell, from a female in her sixth decade.</code> | <code>This measurement was conducted with 10x 5' v1. Sample is a CD4-positive helper T cell, specifically Trm_Th1/Th17 subset, derived from the duodenum tissue of a male individual in his sixth decade.</code> | <code>MALAT1 TPT1 EEF1A1 VIM JUND TMSB4X PTMA FTH1 CRIP1 ANXA1 EIF1 UBC H3-3B ACTB SRGN FTL FAU KLF6 IL7R CALM1 UBA52 BTG1 SARAF IL32 TMSB10 PABPC1 HSP90AB1 DDX5 GAPDH TAGLN2 NACA CD44 HSPA5 RORA HSP90AA1 KLRB1 TNFAIP3 ATP5F1E PNRC1 ZFP36L2 H3-3A UBB FOS RACK1 FYN FAM107B GNAS EZR MYL6 CREM NFKBIA PFN1 ARHGDIB SRSF7 CD2 CCNI HNRNPA2B1 COX7C ITM2B SERF2 SH3BGRL3 TSC22D3 LMNA YWHAZ</code> |
316
+ | <code>MALAT1 GRIK1 SYT1 PCDH9 RORA NRG1 CADPS ZFPM2 LRRC4C LINGO2 RALYL PTPRD SPHKAP CNTNAP5 SLC8A1 CCSER1 HDAC9 CELF2 R3HDM1 CNTN4 RBMS3 PCDH7 GALNT13 UNC5D ROBO1 SYNPR SNAP25 GPM6A ANK3 FRMPD4 CHRM2 RYR2 KHDRBS2 CADM1 CACNA1D RGS6 PDE4D DOCK4 UNC13C CDH18 FAT3 MEG3 NR2F2-AS1 HMCN1 GULP1 CAMK2D ZEB1 SYN2 DYNC1I1 OXR1 DPP10 OSBPL6 FRAS1 PPP3CA ZNF385D ZMAT4 PCBP3 HS6ST3 ERC2 PLEKHA5 CDK14 MAP2 NCOA1 ATP8A2</code> | <code>This measurement was conducted with 10x 3' v3. Neuron cell type from a 29-year-old male, specifically from the thalamic complex, specifically the thalamus (THM) - posterior nuclear complex of thalamus (PoN) - medial geniculate nuclei (MG).</code> | <code>This measurement was conducted with 10x 3' v3. Astrocyte cell type from the thalamic complex, specifically from the thalamus (THM) - posterior nuclear complex of thalamus (PoN) - medial geniculate nuclei (MG) region, of a 42-year-old male.</code> | <code>MALAT1 PCDH9 PLP1 MBP ST18 QKI PDE4B RNF220 PTPRD SEPTIN7 TTLL7 NCKAP5 GPM6B PIP4K2A MOBP SLC44A1 PTGDS PLCL1 MAP7 ELMO1 SIK3 FTH1 ZBTB20 MAN2A1 TMEM165 DOCK10 TCF12 EDIL3 ZEB2 DPYD MAP4K4 PHLPP1 TF GAB1 TRIM2 FRMD4B DNAJC6 MARCHF1 ANK3 DST AGAP1 TMEM144 NEAT1 PLEKHH1 DLG1 CRYAB ERBIN RTN4 SPP1 ATP8A1 DOCK4 SLAIN1 APP DOCK5 APBB2 SAMD12 SHTN1 ZNF536 ZFYVE16 ARAP2 LIMCH1 HIPK2 BCAS1 FAM107B</code> |
317
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
318
+ ```json
319
+ {
320
+ "scale": 20.0,
321
+ "similarity_fct": "cos_sim"
322
+ }
323
+ ```
324
+
325
+ #### gene_description
326
+
327
+ * Dataset: [gene_description](https://huggingface.co/datasets/jo-mengr/descriptions_genes) at [dd22363](https://huggingface.co/datasets/jo-mengr/descriptions_genes/tree/dd22363de0a7c501f41ba324fb3b8d6ecdd14dc7)
328
+ * Size: 116,208 training samples
329
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative_1</code>
330
+ * Approximate statistics based on the first 1000 samples:
331
+ | | anchor | positive | negative_1 |
332
+ |:--------|:---------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|
333
+ | type | string | string | string |
334
+ | details | <ul><li>min: 3 characters</li><li>mean: 5.88 characters</li><li>max: 12 characters</li></ul> | <ul><li>min: 16 characters</li><li>mean: 367.09 characters</li><li>max: 1375 characters</li></ul> | <ul><li>min: 13 characters</li><li>mean: 167.33 characters</li><li>max: 1375 characters</li></ul> |
335
+ * Samples:
336
+ | anchor | positive | negative_1 |
337
+ |:------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|
338
+ | <code>A1BG</code> | <code>The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]</code> | <code>A1BG antisense RNA 1</code> |
339
+ | <code>A1BG</code> | <code>The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]</code> | <code>G antigen 12D</code> |
340
+ | <code>A1BG</code> | <code>The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]</code> | <code>G antigen 12B</code> |
341
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
342
+ ```json
343
+ {
344
+ "scale": 20.0,
345
+ "similarity_fct": "cos_sim"
346
+ }
347
+ ```
348
+
349
+ ### Evaluation Datasets
350
+
351
+ #### cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation
352
+
353
+ * Dataset: [cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation) at [d518eb2](https://huggingface.co/datasets/jo-mengr/cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation/tree/d518eb24af305653b43acd9e26f9502632059e7c)
354
+ * Size: 9,011 evaluation samples
355
+ * Columns: <code>anchor</code>, <code>positive</code>, <code>negative_1</code>, and <code>negative_2</code>
356
+ * Approximate statistics based on the first 1000 samples:
357
+ | | anchor | positive | negative_1 | negative_2 |
358
+ |:--------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|
359
+ | type | string | string | string | string |
360
+ | details | <ul><li>min: 347 characters</li><li>mean: 386.7 characters</li><li>max: 437 characters</li></ul> | <ul><li>min: 99 characters</li><li>mean: 209.99 characters</li><li>max: 941 characters</li></ul> | <ul><li>min: 101 characters</li><li>mean: 208.8 characters</li><li>max: 728 characters</li></ul> | <ul><li>min: 356 characters</li><li>mean: 386.56 characters</li><li>max: 434 characters</li></ul> |
361
+ * Samples:
362
+ | anchor | positive | negative_1 | negative_2 |
363
+ |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
364
+ | <code>MALAT1 EEF1A1 FTH1 TMSB4X ACTB FTL RTN4 ATP6V0B TPT1 FAU S100A6 NDUFA4 ATP5F1E COX7C ITM2B IGFBP7 EIF1 C12orf75 CD9 COX7B SERF2 ATP1B1 COX8A TXNIP NDUFB2 MYL6 PPDPF COX6B1 UQCR11 APOE COX4I1 CALM2 UQCRB S100A11 UQCRQ COX6C ATP5MG BSG ATP6AP2 UQCR10 PTMA NACA UBL5 UBA52 TMSB10 ADGRF5 HSP90AA1 GSTP1 ATP5F1D CHCHD2 GAPDH COX7A2 SKP1 HSPE1 PRDX1 CYSTM1 LGALS3 CD63 ATP5MJ CKB NDUFS5 ATP5ME UBB MAL</code> | <code>This measurement was conducted with 10x 3' v3. Cell sample from the cortex of kidney, taken from a 43-year-old male of European ethnicity with a reported history of kidney cancer. The cell type is identified as a kidney collecting duct intercalated cell.</code> | <code>This measurement was conducted with 10x 3' v3. Cell sample from the cortex of kidney, taken from a 72-year-old male of European ethnicity, identified as a kidney collecting duct intercalated cell, and preserved through cryopreservation.</code> | <code>MALAT1 TMSB4X TMSB10 ACTB TXNIP EEF1A1 TPT1 PFN1 BTG1 FAU PTMA S100A4 ATP5F1E EIF1 FTL CFL1 CYBA MYL12A SRGN SERF2 SH3BGRL3 CALM1 TYROBP MYL6 ZFP36 KLRD1 UBB NACA S100A6 UBA52 HSP90AA1 H3-3B LCP1 FTH1 DDIT4 FOS PPIA CD247 RACK1 TMA7 CORO1A OAZ1 TLE5 ARPC3 GAPDH KLF2 UBC ZFP36L2 TSC22D3 ITGB2 ARPC2 ATP5MG HOPX IFITM2 HMGB1 OST4 EEF1G PRDM1 CDC42 GSTP1 NDUFB2 CIRBP LGALS1 CHCHD2</code> |
365
+ | <code>MALAT1 KCND2 NRXN1 CDH18 NRXN3 ZNF385D CADM2 RALYL NKAIN2 CADPS2 RIMS1 FSTL5 GRID2 TRPM3 CHN2 DPP6 JMJD1C RORA PDE1A UNC13C TIAM1 NRG1 SNAP25 ZFPM2 CALN1 LSAMP CNTN1 ABLIM1 SYNE1 ANK3 CA10 NFIA ZBTB20 NTM CADM1 OPCML RELN DNM3 NEBL ERC1 SCN2A PPP3CA CACNA1A GALNT13 LRRC4C GPM6A RABGAP1L RIT2 CAMK4 GRIA4 PTPRD RBFOX3 MCTP1 LHFPL6 PCLO MEG3 PDE10A NOVA1 RTN1 ZNF385B CNTN4 GABRB2 SPOCK1 OXR1</code> | <code>This measurement was conducted with 10x 3' v3. Neuron cell type from a 29-year-old male cerebellum, specifically from the Cerebellar Vermis - CBV region, with European self-reported ethnicity, analyzed at the nucleus level.</code> | <code>This measurement was conducted with 10x 3' v3. Sample is an oligodendrocyte precursor cell taken from the cerebellum tissue of a 42-year-old human male, specifically from the Cerebellum (CB) - Cerebellar Vermis - CBV dissection.</code> | <code>MALAT1 NRXN3 SNTG1 UNC5C GRIA4 NRG1 RORA INPP4B CLSTN2 NKAIN2 FRMD4A DPP6 GRID2 NRXN1 LSAMP JMJD1C HS6ST3 NXPH1 MIR99AHG LRRC4C NTM CCNH NFIA ZFPM2 AFF3 OPCML PTPRT CADM2 ZBTB20 OLFM3 SLC22A3 CNTNAP5 CACNA2D3 CNTN4 KCND2 ADARB2 XKR4 GPM6A IL1RAPL1 ALK ANKRD36C UBE2E2 SYN3 GARNL3 PTPRG DAB1 TCF4 LINC00461 PRANCR GRIN2B TNRC6B MAPK10 NOVA1 NFIB ANK3 KCNMA1 KCNQ5 SPON1 TRIM9 VWA8 GDAP1 GABRG2 AHI1 ATP1B1</code> |
366
+ | <code>EEF1A1 ACTB GAPDH HMGN2 PTMA SERF2 TMSB4X CD74 PABPC1 FTH1 TMSB10 FAU PFN1 HMGN1 OAZ1 HMGB1 TPT1 PPIA NACA BTF3 MALAT1 MYL6 ATP5MG CFL1 RACK1 ODC1 ATP5F1E TMA7 SLC25A5 ELOB ARPC3 NPM1 COX7C ANP32B C4orf3 EIF1 PCBP2 KLF6 LAPTM5 COX8A RHOA HSPA8 H3-3B PTP4A2 UBA52 OST4 CIRBP LGALS1 EIF3L STMN1 PPDPF COX4I1 RAN EIF3F PPP1CC COMMD6 NDUFA4 YBX1 PEBP1 COTL1 COX7A2 HSPE1 CCNI TRIR</code> | <code>This measurement was conducted with 10x 5' v1. Cell sample from the tonsil of a 9-year-old female with recurrent tonsillitis, characterized as a centroblast B cell with IGLC2, IGLV7-43, IGLJ3 immunoglobulin genes expressed.</code> | <code>This measurement was conducted with 10x 5' v1. Germinal center B cell derived from the tonsil tissue of a 3-year-old male with recurrent tonsillitis.</code> | <code>CD74 MALAT1 EEF1A1 SSR4 TPT1 UBC EEF2 SAT1 RACK1 SEC11C ATP5MG FAU TSC22D3 PPIB XBP1 FTL GAPDH HLA-DRB5 HERPUD1 RGS2 HSPA8 TMSB4X HSP90B1 EIF1 PTMA SERP1 SERF2 NACA SEC61B GSTP1 UBA52 HSPA5 BTF3 LAPTM5 HSPE1 H3-3B ATP5F1A SEC61G CD38 EDF1 FTH1 IL16 NPM1 OST4 CIRBP EIF3E OAZ1 CYTIP PCBP2 MYDGF COX6B1 ZFP36 CSDE1 PABPC1 REXO2 KDELR1 PFN1 PTP4A1 TMBIM6 H1-10 PSAP UBE2J1 VIM MYL6</code> |
367
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
368
+ ```json
369
+ {
370
+ "scale": 20.0,
371
+ "similarity_fct": "cos_sim"
372
+ }
373
+ ```
374
+
375
+ #### gene_description
376
+
377
+ * Dataset: [gene_description](https://huggingface.co/datasets/jo-mengr/descriptions_genes) at [dd22363](https://huggingface.co/datasets/jo-mengr/descriptions_genes/tree/dd22363de0a7c501f41ba324fb3b8d6ecdd14dc7)
378
+ * Size: 1,000 evaluation samples
379
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative_1</code>
380
+ * Approximate statistics based on the first 1000 samples:
381
+ | | anchor | positive | negative_1 |
382
+ |:--------|:---------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|
383
+ | type | string | string | string |
384
+ | details | <ul><li>min: 3 characters</li><li>mean: 5.88 characters</li><li>max: 12 characters</li></ul> | <ul><li>min: 16 characters</li><li>mean: 367.09 characters</li><li>max: 1375 characters</li></ul> | <ul><li>min: 13 characters</li><li>mean: 167.33 characters</li><li>max: 1375 characters</li></ul> |
385
+ * Samples:
386
+ | anchor | positive | negative_1 |
387
+ |:------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------|
388
+ | <code>A1BG</code> | <code>The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]</code> | <code>A1BG antisense RNA 1</code> |
389
+ | <code>A1BG</code> | <code>The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]</code> | <code>G antigen 12D</code> |
390
+ | <code>A1BG</code> | <code>The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]</code> | <code>G antigen 12B</code> |
391
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
392
+ ```json
393
+ {
394
+ "scale": 20.0,
395
+ "similarity_fct": "cos_sim"
396
+ }
397
+ ```
398
+
399
+ ### Training Hyperparameters
400
+ #### Non-Default Hyperparameters
401
+
402
+ - `eval_strategy`: steps
403
+ - `per_device_train_batch_size`: 128
404
+ - `per_device_eval_batch_size`: 128
405
+ - `learning_rate`: 2e-05
406
+ - `num_train_epochs`: 4
407
+ - `warmup_ratio`: 0.1
408
+ - `bf16`: True
409
+ - `gradient_checkpointing`: True
410
+
411
+ #### All Hyperparameters
412
+ <details><summary>Click to expand</summary>
413
+
414
+ - `overwrite_output_dir`: False
415
+ - `do_predict`: False
416
+ - `eval_strategy`: steps
417
+ - `prediction_loss_only`: True
418
+ - `per_device_train_batch_size`: 128
419
+ - `per_device_eval_batch_size`: 128
420
+ - `per_gpu_train_batch_size`: None
421
+ - `per_gpu_eval_batch_size`: None
422
+ - `gradient_accumulation_steps`: 1
423
+ - `eval_accumulation_steps`: None
424
+ - `torch_empty_cache_steps`: None
425
+ - `learning_rate`: 2e-05
426
+ - `weight_decay`: 0.0
427
+ - `adam_beta1`: 0.9
428
+ - `adam_beta2`: 0.999
429
+ - `adam_epsilon`: 1e-08
430
+ - `max_grad_norm`: 1.0
431
+ - `num_train_epochs`: 4
432
+ - `max_steps`: -1
433
+ - `lr_scheduler_type`: linear
434
+ - `lr_scheduler_kwargs`: {}
435
+ - `warmup_ratio`: 0.1
436
+ - `warmup_steps`: 0
437
+ - `log_level`: passive
438
+ - `log_level_replica`: warning
439
+ - `log_on_each_node`: True
440
+ - `logging_nan_inf_filter`: True
441
+ - `save_safetensors`: True
442
+ - `save_on_each_node`: False
443
+ - `save_only_model`: False
444
+ - `restore_callback_states_from_checkpoint`: False
445
+ - `no_cuda`: False
446
+ - `use_cpu`: False
447
+ - `use_mps_device`: False
448
+ - `seed`: 42
449
+ - `data_seed`: None
450
+ - `jit_mode_eval`: False
451
+ - `use_ipex`: False
452
+ - `bf16`: True
453
+ - `fp16`: False
454
+ - `fp16_opt_level`: O1
455
+ - `half_precision_backend`: auto
456
+ - `bf16_full_eval`: False
457
+ - `fp16_full_eval`: False
458
+ - `tf32`: None
459
+ - `local_rank`: 0
460
+ - `ddp_backend`: None
461
+ - `tpu_num_cores`: None
462
+ - `tpu_metrics_debug`: False
463
+ - `debug`: []
464
+ - `dataloader_drop_last`: False
465
+ - `dataloader_num_workers`: 0
466
+ - `dataloader_prefetch_factor`: None
467
+ - `past_index`: -1
468
+ - `disable_tqdm`: False
469
+ - `remove_unused_columns`: True
470
+ - `label_names`: None
471
+ - `load_best_model_at_end`: False
472
+ - `ignore_data_skip`: False
473
+ - `fsdp`: []
474
+ - `fsdp_min_num_params`: 0
475
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
476
+ - `fsdp_transformer_layer_cls_to_wrap`: None
477
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
478
+ - `deepspeed`: None
479
+ - `label_smoothing_factor`: 0.0
480
+ - `optim`: adamw_torch
481
+ - `optim_args`: None
482
+ - `adafactor`: False
483
+ - `group_by_length`: False
484
+ - `length_column_name`: length
485
+ - `ddp_find_unused_parameters`: None
486
+ - `ddp_bucket_cap_mb`: None
487
+ - `ddp_broadcast_buffers`: False
488
+ - `dataloader_pin_memory`: True
489
+ - `dataloader_persistent_workers`: False
490
+ - `skip_memory_metrics`: True
491
+ - `use_legacy_prediction_loop`: False
492
+ - `push_to_hub`: False
493
+ - `resume_from_checkpoint`: None
494
+ - `hub_model_id`: None
495
+ - `hub_strategy`: every_save
496
+ - `hub_private_repo`: None
497
+ - `hub_always_push`: False
498
+ - `hub_revision`: None
499
+ - `gradient_checkpointing`: True
500
+ - `gradient_checkpointing_kwargs`: None
501
+ - `include_inputs_for_metrics`: False
502
+ - `include_for_metrics`: []
503
+ - `eval_do_concat_batches`: True
504
+ - `fp16_backend`: auto
505
+ - `push_to_hub_model_id`: None
506
+ - `push_to_hub_organization`: None
507
+ - `mp_parameters`:
508
+ - `auto_find_batch_size`: False
509
+ - `full_determinism`: False
510
+ - `torchdynamo`: None
511
+ - `ray_scope`: last
512
+ - `ddp_timeout`: 1800
513
+ - `torch_compile`: False
514
+ - `torch_compile_backend`: None
515
+ - `torch_compile_mode`: None
516
+ - `include_tokens_per_second`: False
517
+ - `include_num_input_tokens_seen`: False
518
+ - `neftune_noise_alpha`: None
519
+ - `optim_target_modules`: None
520
+ - `batch_eval_metrics`: False
521
+ - `eval_on_start`: False
522
+ - `use_liger_kernel`: False
523
+ - `liger_kernel_config`: None
524
+ - `eval_use_gather_object`: False
525
+ - `average_tokens_across_devices`: False
526
+ - `prompts`: None
527
+ - `batch_sampler`: batch_sampler
528
+ - `multi_dataset_batch_sampler`: proportional
529
+ - `router_mapping`: {}
530
+ - `learning_rate_mapping`: {}
531
+
532
+ </details>
533
+
534
+ ### Training Logs
535
+ <details><summary>Click to expand</summary>
536
+
537
+ | Epoch | Step | Training Loss | cellxgene pseudo bulk 100k multiplets natural language annotation loss | gene description loss | cellxgene_pseudo_bulk_100k_multiplets_natural_language_annotation_cell_sentence_2_cosine_accuracy | gene_description_cosine_accuracy |
538
+ |:------:|:----:|:-------------:|:----------------------------------------------------------------------:|:---------------------:|:-------------------------------------------------------------------------------------------------:|:--------------------------------:|
539
+ | 0.0324 | 50 | 9.3314 | 12.6479 | 6.6616 | 0.5052 | 0.2570 |
540
+ | 0.0649 | 100 | 7.9528 | 10.8869 | 6.0596 | 0.5078 | 0.2660 |
541
+ | 0.0973 | 150 | 7.0084 | 7.0423 | 5.4704 | 0.5075 | 0.3020 |
542
+ | 0.1297 | 200 | 5.6925 | 6.0263 | 5.2950 | 0.5024 | 0.5200 |
543
+ | 0.1621 | 250 | 5.381 | 5.8141 | 4.7323 | 0.5367 | 0.6520 |
544
+ | 0.1946 | 300 | 4.3736 | 5.4432 | 4.3565 | 0.5518 | 0.7060 |
545
+ | 0.2270 | 350 | 3.8184 | 5.1966 | 4.1283 | 0.5836 | 0.7690 |
546
+ | 0.2594 | 400 | 3.6181 | 5.0588 | 3.9594 | 0.6064 | 0.7650 |
547
+ | 0.2918 | 450 | 3.1076 | 4.9406 | 3.7824 | 0.6218 | 0.8030 |
548
+ | 0.3243 | 500 | 3.127 | 4.8376 | 3.6785 | 0.6369 | 0.8230 |
549
+ | 0.3567 | 550 | 3.1702 | 4.8230 | 3.6029 | 0.6532 | 0.8410 |
550
+ | 0.3891 | 600 | 2.992 | 5.1160 | 3.6091 | 0.6240 | 0.8310 |
551
+ | 0.4215 | 650 | 2.606 | 4.5652 | 3.5555 | 0.6679 | 0.8490 |
552
+ | 0.4540 | 700 | 2.9473 | 4.5831 | 3.5215 | 0.6846 | 0.8600 |
553
+ | 0.4864 | 750 | 2.369 | 4.4464 | 3.4824 | 0.6930 | 0.8800 |
554
+ | 0.5188 | 800 | 2.5923 | 4.4542 | 3.4372 | 0.6983 | 0.8820 |
555
+ | 0.5512 | 850 | 2.9167 | 4.4572 | 3.4915 | 0.6984 | 0.8730 |
556
+ | 0.5837 | 900 | 2.5716 | 4.2259 | 3.4390 | 0.7126 | 0.8630 |
557
+ | 0.6161 | 950 | 2.375 | 4.2200 | 3.4250 | 0.7143 | 0.8740 |
558
+ | 0.6485 | 1000 | 2.4105 | 4.2001 | 3.3524 | 0.7187 | 0.8890 |
559
+ | 0.6809 | 1050 | 2.4014 | 4.0744 | 3.2688 | 0.7243 | 0.8950 |
560
+ | 0.7134 | 1100 | 2.7474 | 4.1131 | 3.3046 | 0.7270 | 0.8850 |
561
+ | 0.7458 | 1150 | 2.1615 | 4.2206 | 3.2392 | 0.7202 | 0.8860 |
562
+ | 0.7782 | 1200 | 2.4409 | 4.4682 | 3.1664 | 0.7106 | 0.8870 |
563
+ | 0.8106 | 1250 | 2.5041 | 4.0881 | 3.1417 | 0.7277 | 0.9030 |
564
+ | 0.8431 | 1300 | 2.4221 | 3.8777 | 3.2302 | 0.7409 | 0.8940 |
565
+ | 0.8755 | 1350 | 2.189 | 3.8482 | 3.1316 | 0.7441 | 0.9050 |
566
+ | 0.9079 | 1400 | 2.3055 | 3.8571 | 3.1550 | 0.7451 | 0.9030 |
567
+ | 0.9403 | 1450 | 2.0945 | 3.8233 | 3.1269 | 0.7530 | 0.9020 |
568
+ | 0.9728 | 1500 | 2.0217 | 3.7722 | 3.0707 | 0.7527 | 0.9070 |
569
+ | 1.0052 | 1550 | 2.2443 | 3.8285 | 3.0799 | 0.7459 | 0.9190 |
570
+ | 1.0376 | 1600 | 1.9441 | 3.8292 | 3.0957 | 0.7470 | 0.9090 |
571
+ | 1.0700 | 1650 | 1.8771 | 3.6837 | 3.0190 | 0.7555 | 0.9290 |
572
+ | 1.1025 | 1700 | 1.9489 | 3.6946 | 3.0298 | 0.7570 | 0.9210 |
573
+ | 1.1349 | 1750 | 2.0622 | 3.7221 | 3.0001 | 0.7574 | 0.9140 |
574
+ | 1.1673 | 1800 | 1.7275 | 3.7806 | 2.9919 | 0.7530 | 0.9090 |
575
+ | 1.1997 | 1850 | 2.0068 | 3.6648 | 2.9490 | 0.7584 | 0.9230 |
576
+ | 1.2322 | 1900 | 1.9126 | 3.7416 | 2.9131 | 0.7603 | 0.9160 |
577
+ | 1.2646 | 1950 | 1.9513 | 3.5770 | 2.9362 | 0.7625 | 0.9230 |
578
+ | 1.2970 | 2000 | 1.8021 | 3.6660 | 2.8868 | 0.7670 | 0.9360 |
579
+ | 1.3294 | 2050 | 1.9685 | 3.7318 | 2.8669 | 0.7587 | 0.9390 |
580
+ | 1.3619 | 2100 | 1.7835 | 3.5471 | 2.8356 | 0.7712 | 0.9350 |
581
+ | 1.3943 | 2150 | 1.826 | 3.5666 | 2.7893 | 0.7707 | 0.9340 |
582
+ | 1.4267 | 2200 | 1.9708 | 3.5630 | 2.7570 | 0.7741 | 0.9290 |
583
+ | 1.4591 | 2250 | 2.0131 | 3.5586 | 2.8239 | 0.7742 | 0.9360 |
584
+ | 1.4916 | 2300 | 1.856 | 3.5155 | 2.7658 | 0.7779 | 0.9410 |
585
+ | 1.5240 | 2350 | 1.9354 | 3.7959 | 2.7921 | 0.7622 | 0.9380 |
586
+ | 1.5564 | 2400 | 1.8961 | 3.5166 | 2.7456 | 0.7790 | 0.9430 |
587
+ | 1.5888 | 2450 | 1.6347 | 3.4784 | 2.7911 | 0.7800 | 0.9470 |
588
+ | 1.6213 | 2500 | 1.9176 | 3.4388 | 2.7349 | 0.7829 | 0.9440 |
589
+ | 1.6537 | 2550 | 2.0475 | 3.6968 | 2.7456 | 0.7754 | 0.9390 |
590
+ | 1.6861 | 2600 | 1.7946 | 3.4758 | 2.7046 | 0.7848 | 0.9470 |
591
+ | 1.7185 | 2650 | 1.9581 | 3.3828 | 2.7022 | 0.7867 | 0.9430 |
592
+ | 1.7510 | 2700 | 1.8475 | 3.3631 | 2.6706 | 0.7903 | 0.9470 |
593
+ | 1.7834 | 2750 | 1.836 | 3.5622 | 2.6512 | 0.7857 | 0.9450 |
594
+ | 1.8158 | 2800 | 2.051 | 3.3523 | 2.6542 | 0.7926 | 0.9390 |
595
+ | 1.8482 | 2850 | 1.829 | 3.3676 | 2.6730 | 0.7925 | 0.9390 |
596
+ | 1.8807 | 2900 | 1.7557 | 3.3632 | 2.6536 | 0.7954 | 0.9470 |
597
+ | 1.9131 | 2950 | 1.7725 | 3.3448 | 2.6437 | 0.7946 | 0.9470 |
598
+ | 1.9455 | 3000 | 1.7373 | 3.2736 | 2.6562 | 0.7987 | 0.9440 |
599
+ | 1.9780 | 3050 | 1.886 | 3.3404 | 2.6456 | 0.7958 | 0.9450 |
600
+ | 2.0104 | 3100 | 1.7217 | 3.2570 | 2.6893 | 0.7988 | 0.9400 |
601
+ | 2.0428 | 3150 | 1.6235 | 3.2331 | 2.6132 | 0.8004 | 0.9430 |
602
+ | 2.0752 | 3200 | 1.6678 | 3.2466 | 2.5904 | 0.8030 | 0.9470 |
603
+ | 2.1077 | 3250 | 1.6784 | 3.2339 | 2.5956 | 0.8008 | 0.9480 |
604
+ | 2.1401 | 3300 | 1.8422 | 3.2286 | 2.5997 | 0.8039 | 0.9480 |
605
+ | 2.1725 | 3350 | 1.4859 | 3.2163 | 2.5924 | 0.8049 | 0.9470 |
606
+ | 2.2049 | 3400 | 1.6165 | 3.3246 | 2.6167 | 0.7989 | 0.9440 |
607
+ | 2.2374 | 3450 | 1.65 | 3.2184 | 2.5864 | 0.8039 | 0.9460 |
608
+ | 2.2698 | 3500 | 1.5071 | 3.2274 | 2.5788 | 0.8019 | 0.9460 |
609
+ | 2.3022 | 3550 | 1.5238 | 3.2032 | 2.5608 | 0.8075 | 0.9480 |
610
+ | 2.3346 | 3600 | 1.568 | 3.2409 | 2.5649 | 0.8081 | 0.9470 |
611
+ | 2.3671 | 3650 | 1.4644 | 3.1937 | 2.5841 | 0.8079 | 0.9430 |
612
+ | 2.3995 | 3700 | 1.5782 | 3.2033 | 2.5909 | 0.8065 | 0.9450 |
613
+ | 2.4319 | 3750 | 1.6976 | 3.1905 | 2.5690 | 0.8073 | 0.9470 |
614
+ | 2.4643 | 3800 | 1.4682 | 3.2078 | 2.5610 | 0.8052 | 0.9490 |
615
+ | 2.4968 | 3850 | 1.7414 | 3.1822 | 2.5650 | 0.8072 | 0.9500 |
616
+ | 2.5292 | 3900 | 1.654 | 3.1890 | 2.5566 | 0.8110 | 0.9490 |
617
+ | 2.5616 | 3950 | 1.5187 | 3.1843 | 2.5508 | 0.8090 | 0.9470 |
618
+ | 2.5940 | 4000 | 1.4893 | 3.1855 | 2.5527 | 0.8067 | 0.9470 |
619
+ | 2.6265 | 4050 | 1.6716 | 3.1520 | 2.5432 | 0.8093 | 0.9480 |
620
+ | 2.6589 | 4100 | 1.4914 | 3.1868 | 2.5466 | 0.8099 | 0.9500 |
621
+ | 2.6913 | 4150 | 1.6231 | 3.1702 | 2.5235 | 0.8112 | 0.9500 |
622
+ | 2.7237 | 4200 | 1.6058 | 3.1561 | 2.5171 | 0.8096 | 0.9520 |
623
+ | 2.7562 | 4250 | 1.5753 | 3.1660 | 2.5068 | 0.8111 | 0.9530 |
624
+ | 2.7886 | 4300 | 1.4654 | 3.1507 | 2.5156 | 0.8138 | 0.9510 |
625
+ | 2.8210 | 4350 | 1.5901 | 3.1960 | 2.4917 | 0.8115 | 0.9540 |
626
+ | 2.8534 | 4400 | 1.5034 | 3.1491 | 2.4960 | 0.8116 | 0.9550 |
627
+ | 2.8859 | 4450 | 1.4088 | 3.1505 | 2.5086 | 0.8133 | 0.9530 |
628
+ | 2.9183 | 4500 | 1.5527 | 3.1671 | 2.5154 | 0.8112 | 0.9540 |
629
+ | 2.9507 | 4550 | 1.5344 | 3.1329 | 2.5016 | 0.8141 | 0.9530 |
630
+ | 2.9831 | 4600 | 1.4156 | 3.1439 | 2.4858 | 0.8146 | 0.9550 |
631
+ | 3.0156 | 4650 | 1.8602 | 3.1056 | 2.4799 | 0.8163 | 0.9550 |
632
+ | 3.0480 | 4700 | 1.4472 | 3.1387 | 2.4539 | 0.8126 | 0.9540 |
633
+ | 3.0804 | 4750 | 1.3582 | 3.1220 | 2.4676 | 0.8159 | 0.9530 |
634
+ | 3.1128 | 4800 | 1.5408 | 3.1309 | 2.4722 | 0.8142 | 0.9540 |
635
+ | 3.1453 | 4850 | 1.3755 | 3.1227 | 2.4624 | 0.8171 | 0.9530 |
636
+ | 3.1777 | 4900 | 1.4571 | 3.1284 | 2.4410 | 0.8162 | 0.9560 |
637
+ | 3.2101 | 4950 | 1.5657 | 3.0882 | 2.4486 | 0.8167 | 0.9550 |
638
+ | 3.2425 | 5000 | 1.5325 | 3.0980 | 2.4339 | 0.8178 | 0.9540 |
639
+ | 3.2750 | 5050 | 1.4671 | 3.0961 | 2.4625 | 0.8169 | 0.9550 |
640
+ | 3.3074 | 5100 | 1.4808 | 3.1176 | 2.4578 | 0.8180 | 0.9550 |
641
+ | 3.3398 | 5150 | 1.4172 | 3.1338 | 2.4515 | 0.8168 | 0.9550 |
642
+ | 3.3722 | 5200 | 1.4953 | 3.1047 | 2.4425 | 0.8174 | 0.9540 |
643
+ | 3.4047 | 5250 | 1.6419 | 3.1081 | 2.4317 | 0.8180 | 0.9540 |
644
+ | 3.4371 | 5300 | 1.5425 | 3.0910 | 2.4481 | 0.8210 | 0.9560 |
645
+ | 3.4695 | 5350 | 1.5598 | 3.1049 | 2.4365 | 0.8198 | 0.9560 |
646
+ | 3.5019 | 5400 | 1.4086 | 3.1036 | 2.4352 | 0.8198 | 0.9550 |
647
+ | 3.5344 | 5450 | 1.6057 | 3.1076 | 2.4269 | 0.8197 | 0.9560 |
648
+ | 3.5668 | 5500 | 1.6735 | 3.0792 | 2.4291 | 0.8200 | 0.9550 |
649
+ | 3.5992 | 5550 | 1.401 | 3.0959 | 2.4364 | 0.8211 | 0.9550 |
650
+ | 3.6316 | 5600 | 1.2475 | 3.0909 | 2.4324 | 0.8202 | 0.9570 |
651
+ | 3.6641 | 5650 | 1.2495 | 3.0686 | 2.4148 | 0.8210 | 0.9550 |
652
+ | 3.6965 | 5700 | 1.4457 | 3.0837 | 2.4123 | 0.8197 | 0.9570 |
653
+ | 3.7289 | 5750 | 1.5794 | 3.0877 | 2.4171 | 0.8191 | 0.9560 |
654
+ | 3.7613 | 5800 | 1.5696 | 3.0936 | 2.4153 | 0.8186 | 0.9560 |
655
+ | 3.7938 | 5850 | 1.5947 | 3.0778 | 2.4173 | 0.8190 | 0.9560 |
656
+ | 3.8262 | 5900 | 1.4517 | 3.0760 | 2.4242 | 0.8202 | 0.9560 |
657
+ | 3.8586 | 5950 | 1.553 | 3.0897 | 2.4222 | 0.8188 | 0.9580 |
658
+ | 3.8911 | 6000 | 1.2109 | 3.0683 | 2.4233 | 0.8211 | 0.9550 |
659
+ | 3.9235 | 6050 | 1.4384 | 3.0756 | 2.4221 | 0.8208 | 0.9560 |
660
+ | 3.9559 | 6100 | 1.4945 | 3.0755 | 2.4179 | 0.8202 | 0.9560 |
661
+ | 3.9883 | 6150 | 1.4597 | 3.0686 | 2.4183 | 0.8204 | 0.9560 |
662
+
663
+ </details>
664
+
665
+ ### Framework Versions
666
+ - Python: 3.11.6
667
+ - Sentence Transformers: 5.0.0
668
+ - Transformers: 4.55.0.dev0
669
+ - PyTorch: 2.5.1+cu121
670
+ - Accelerate: 1.9.0
671
+ - Datasets: 2.19.1
672
+ - Tokenizers: 0.21.4
673
+
674
+ ## Citation
675
+
676
+ ### BibTeX
677
+
678
+ #### Sentence Transformers
679
+ ```bibtex
680
+ @inproceedings{reimers-2019-sentence-bert,
681
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
682
+ author = "Reimers, Nils and Gurevych, Iryna",
683
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
684
+ month = "11",
685
+ year = "2019",
686
+ publisher = "Association for Computational Linguistics",
687
+ url = "https://arxiv.org/abs/1908.10084",
688
+ }
689
+ ```
690
+
691
+ #### MultipleNegativesRankingLoss
692
+ ```bibtex
693
+ @misc{henderson2017efficient,
694
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
695
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
696
+ year={2017},
697
+ eprint={1705.00652},
698
+ archivePrefix={arXiv},
699
+ primaryClass={cs.CL}
700
+ }
701
+ ```
702
+
703
+ <!--
704
+ ## Glossary
705
+
706
+ *Clearly define terms in order to be accessible across audiences.*
707
+ -->
708
+
709
+ <!--
710
+ ## Model Card Authors
711
+
712
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
713
+ -->
714
+
715
+ <!--
716
+ ## Model Card Contact
717
+
718
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
719
+ -->
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.0.0",
5
+ "transformers": "4.55.0.dev0",
6
+ "pytorch": "2.5.1+cu121"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "0_MMContextEncoder",
6
+ "type": "mmcontext.models.mmcontextencoder.MMContextEncoder"
7
+ }
8
+ ]