emdann
/

clin-oracle-tahoe-deepdive

tahoe-deepdive

Model card Files Files and versions

xet

Community

emdann commited on May 11

Commit

37b6d41

verified ·

1 Parent(s): 032c0ea

Upload folder using huggingface_hub

Browse files

Files changed (1) hide show

README.md +54 -4

README.md CHANGED Viewed

@@ -1,14 +1,64 @@
-# clinOracle team - tahoe-hack-2025
-Predicting clinical outcomes from _in vitro_ transcriptional responses.
-## Code
 - `representations/` - scripts to train and transcriptome effect representations on Tahoe-100M
 - `clinical_data_curation/` - scripts to curate clinical trial data
 - `approval_prediction_benchmark.ipynb` - benchmark on clinical approval prediction
 - `classifier.py` - Benchmarking classifier implementation
-## Data
 - `clinical_evidence_data/` - Curated clinical evidence data on Tahoe drugs
 - `data_for_classifier/` - input data for benchmarks
 - `data/` - misc processed data

+---
+tags:
+- tahoe-deepdive
+license: "gpl-3.0"
+datasets:
+- tahoebio/Tahoe-100M
+---
+# ClinOracle
+## Contents
+#### Code
 - `representations/` - scripts to train and transcriptome effect representations on Tahoe-100M
 - `clinical_data_curation/` - scripts to curate clinical trial data
 - `approval_prediction_benchmark.ipynb` - benchmark on clinical approval prediction
 - `classifier.py` - Benchmarking classifier implementation
+#### Data
 - `clinical_evidence_data/` - Curated clinical evidence data on Tahoe drugs
 - `data_for_classifier/` - input data for benchmarks
 - `data/` - misc processed data
+## Team Members
+- Emma Dann
+- Tony Zen
+- Ross Giglio
+- Kevin Hoffer-Hawlik
+- Meer Mustafa
+## Project
+### Pharmacotranscriptomic representations to predict clinical trial success
+### Overview
+Large _in vitro_ perturbation screens like Tahoe-100M allow for assessing whether transcriptional responses are predictive of metrics of clinical success like drug approval.
+### Motivation
+Despite rigorous research efforts, clinical success and drug approval is challenging and difficult to predict in early drug development.
+### Methods
+#### Clinical trial information
+We used LLMs to collected clinical trial and adverse effects data associated with the chemical agents screened in Tahoe-100M, annotated which drugs were tested or reached approval for a condition affecting one of the screened organs.
+#### Transcriptome effects representations
+- E-distance: overall transcriptional shift from DMSO for each drug in each cell line. We selected the dose with max e-distance for each drug-cellline pair.
+- LDVAE: VAE with linear decoder for gene program interpretability (trained on plates 1-4 and generated embedding for full dataset)
+- mrVI: sample-aware VAE representation. Using the pseudobulked Tahoe-100M data, we trained a MrVI model with sample defined as cell_drug with the union of highly variable genes within cell line as features. We generated two-latent embeddings, the 10-dimensional u-space and the 30-dimensional z-space that were used as input to the classifier.
+#### Benchmark set-up
+We use logistic regression on the transcriptome-effect representations to predict whether a drug was approved for a tissue of interest, splitting drugs into train and test set and evaluating the precision-recall curve for the test drugs. We consider rate of approvals per organ as a technical confounder to be accounted for.
+### Results
+None of the unsupervised multi-dimensional representations outperformed the approval rate baseline, while we found that e-distance is consistently negatively associated with approval for conditions affecting the target tissue.
+### Discussion and Future Work
+With the concept established, we propose expanding by testing additional representations of the data including MrVI single-cell sample-sample distances, differential gene expression or program expression, and cell counts. The framework is setup to test additional and advanced prediction metrics like clinical trial phase success and AE rate or severity prediction.