emdann commited on
Commit
37b6d41
·
verified ·
1 Parent(s): 032c0ea

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +54 -4
README.md CHANGED
@@ -1,14 +1,64 @@
1
- # clinOracle team - tahoe-hack-2025
 
 
 
 
 
 
2
 
3
- Predicting clinical outcomes from _in vitro_ transcriptional responses.
4
 
5
- ## Code
 
 
6
  - `representations/` - scripts to train and transcriptome effect representations on Tahoe-100M
7
  - `clinical_data_curation/` - scripts to curate clinical trial data
8
  - `approval_prediction_benchmark.ipynb` - benchmark on clinical approval prediction
9
  - `classifier.py` - Benchmarking classifier implementation
10
 
11
- ## Data
12
  - `clinical_evidence_data/` - Curated clinical evidence data on Tahoe drugs
13
  - `data_for_classifier/` - input data for benchmarks
14
  - `data/` - misc processed data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - tahoe-deepdive
4
+ license: "gpl-3.0"
5
+ datasets:
6
+ - tahoebio/Tahoe-100M
7
+ ---
8
 
9
+ # ClinOracle
10
 
11
+ ## Contents
12
+
13
+ #### Code
14
  - `representations/` - scripts to train and transcriptome effect representations on Tahoe-100M
15
  - `clinical_data_curation/` - scripts to curate clinical trial data
16
  - `approval_prediction_benchmark.ipynb` - benchmark on clinical approval prediction
17
  - `classifier.py` - Benchmarking classifier implementation
18
 
19
+ #### Data
20
  - `clinical_evidence_data/` - Curated clinical evidence data on Tahoe drugs
21
  - `data_for_classifier/` - input data for benchmarks
22
  - `data/` - misc processed data
23
+
24
+
25
+ ## Team Members
26
+ - Emma Dann
27
+ - Tony Zen
28
+ - Ross Giglio
29
+ - Kevin Hoffer-Hawlik
30
+ - Meer Mustafa
31
+
32
+ ## Project
33
+ ### Pharmacotranscriptomic representations to predict clinical trial success
34
+
35
+ ### Overview
36
+ Large _in vitro_ perturbation screens like Tahoe-100M allow for assessing whether transcriptional responses are predictive of metrics of clinical success like drug approval.
37
+
38
+ ### Motivation
39
+ Despite rigorous research efforts, clinical success and drug approval is challenging and difficult to predict in early drug development.
40
+
41
+ ### Methods
42
+
43
+ #### Clinical trial information
44
+ We used LLMs to collected clinical trial and adverse effects data associated with the chemical agents screened in Tahoe-100M, annotated which drugs were tested or reached approval for a condition affecting one of the screened organs.
45
+
46
+ #### Transcriptome effects representations
47
+
48
+ - E-distance: overall transcriptional shift from DMSO for each drug in each cell line. We selected the dose with max e-distance for each drug-cellline pair.
49
+ - LDVAE: VAE with linear decoder for gene program interpretability (trained on plates 1-4 and generated embedding for full dataset)
50
+ - mrVI: sample-aware VAE representation. Using the pseudobulked Tahoe-100M data, we trained a MrVI model with sample defined as cell_drug with the union of highly variable genes within cell line as features. We generated two-latent embeddings, the 10-dimensional u-space and the 30-dimensional z-space that were used as input to the classifier.
51
+
52
+ #### Benchmark set-up
53
+
54
+ We use logistic regression on the transcriptome-effect representations to predict whether a drug was approved for a tissue of interest, splitting drugs into train and test set and evaluating the precision-recall curve for the test drugs. We consider rate of approvals per organ as a technical confounder to be accounted for.
55
+
56
+ ### Results
57
+
58
+ None of the unsupervised multi-dimensional representations outperformed the approval rate baseline, while we found that e-distance is consistently negatively associated with approval for conditions affecting the target tissue.
59
+
60
+ ### Discussion and Future Work
61
+
62
+ With the concept established, we propose expanding by testing additional representations of the data including MrVI single-cell sample-sample distances, differential gene expression or program expression, and cell counts. The framework is setup to test additional and advanced prediction metrics like clinical trial phase success and AE rate or severity prediction.
63
+
64
+