| | --- |
| | license: cc-by-4.0 |
| | language: |
| | - en |
| | tags: |
| | - proteomics |
| | - mass-spectrometry |
| | - peptide-sequencing |
| | - de-novo |
| | - calibration |
| | - fdr |
| | --- |
| | |
| | ## Winnow HeLa Single Shot Probability Calibrator |
| |
|
| | [**Winnow**](https://github.com/instadeepai/winnow) recalibrates confidence scores and provides FDR control for *de novo* peptide sequencing (DNS) workflows. |
| | This repository contains the calibrator trained on HeLa Single Shot data as referenced in our paper: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952). |
| |
|
| | - Intended inputs: spectrum input data and corresponding MS/MS PSM results produced by [InstaNovo](https://github.com/instadeepai/instanovo) |
| | - Outputs: calibrated per-PSM probabilities in `calibrated_confidence`. |
| |
|
| | ### What’s inside |
| | - `calibrator.pkl`: trained classifier |
| | - `scaler.pkl`: feature standardiser |
| | - `irt_predictor.pkl`: Prosit iRT regressor used by RT features |
| |
|
| | --- |
| |
|
| | ## How to use |
| |
|
| | ### Python |
| | ```python |
| | from pathlib import Path |
| | from huggingface_hub import snapshot_download |
| | from winnow.calibration.calibrator import ProbabilityCalibrator |
| | from winnow.datasets.data_loaders import InstaNovoDatasetLoader |
| | from winnow.scripts.main import filter_dataset |
| | from winnow.fdr.nonparametric import NonParametricFDRControl |
| | |
| | # 1) Download model files |
| | helaqc_model = Path("helaqc_model") |
| | snapshot_download( |
| | repo_id="InstaDeepAI/winnow-helaqc-model", |
| | allow_patterns=["*.pkl"], |
| | repo_type="model", |
| | local_dir=helaqc_model, |
| | ) |
| | |
| | # 2) Load calibrator |
| | calibrator = ProbabilityCalibrator.load(helaqc_model) |
| | |
| | # 3) Load your dataset (InstaNovo-style config) |
| | dataset = InstaNovoDatasetLoader().load( |
| | data_path="path_to_spectrum_data.parquet", |
| | predictions_path="path_to_instanovo_predictions.csv", |
| | ) |
| | dataset = filter_dataset(dataset) # standard Winnow filtering |
| | |
| | # 4) Predict calibrated confidences |
| | calibrator.predict(dataset) # adds dataset.metadata["calibrated_confidence"] |
| | |
| | # 5) Optional: FDR control on calibrated confidence |
| | fdr = NonParametricFDRControl() |
| | fdr.fit(dataset.metadata["calibrated_confidence"]) |
| | cutoff = fdr.get_confidence_cutoff(0.05) # 5% FDR cutoff |
| | dataset.metadata["keep@5%"] = dataset.metadata["calibrated_confidence"] >= cutoff |
| | ``` |
| |
|
| | ### CLI |
| | ```bash |
| | # After `pip install winnow` |
| | winnow predict \ |
| | --data-source instanovo \ |
| | --dataset-config-path config_with_dataset_paths.yaml \ |
| | --model-folder general_model_folder \ |
| | --method winnow \ |
| | --fdr-threshold 0.05 \ |
| | --confidence-column calibrated_confidence \ |
| | --output-path outputs/winnow_predictions.csv |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Inputs and outputs |
| | **Required columns for calibration:** |
| | - Spectrum data (*.parquet) |
| | - `spectrum_id` (string): unique spectrum identifier |
| | - `sequence` (string): ground truth peptide sequence from database search (optional) |
| | - `retention_time` (float): retention time (seconds) |
| | - `precursor_mass` (float): mass of the precursor ion (from MS1) |
| | - `mz_array` (list[float]): mass-to-charge values of the MS2 spectrum |
| | - `intensity_array` (list[float]): intensity values of the MS2 spectrum |
| | - `precursor_charge` (int): charge of the precursor (from MS1) |
| | |
| | - Beam predictions (*_beams.csv) |
| | - `spectrum_id` (string) |
| | - `sequence` (string): ground truth peptide sequence from database search (optional) |
| | - `preds` (string): top prediction, untokenised sequence |
| | - `preds_tokenised` (string): comma‐separated tokens for the top prediction |
| | - `log_probs` (float): top prediction log probability |
| | - `preds_beam_k` (string): untokenised sequence for beam k (k≥0) |
| | - `log_probs_beam_k` (float) |
| | - `token_log_probs_k` (string/list-encoded): per-token log probabilities for beam k |
| |
|
| | **Output columns (added by Winnow's calibrator on `predict`):** |
| | - `calibrated_confidence`: calibrated probability |
| | - Optional (if requested): `psm_pep`, `psm_fdr`, `psm_qvalue` |
| | - All input columns are retained in-place |
| |
|
| | --- |
| |
|
| | ## Training data |
| |
|
| | - The general model was trained on the HeLa single-shot dataset (PXD044934) |
| | - All default features were enabled for the training of this model. |
| | - Predictions were obtained using InstaNovo v1.1.1 with knapsack beam search set to 50 beams. |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you use `winnow` in your research, please cite our preprint: [De novo peptide sequencing rescoring and FDR estimation with Winnow](https://arxiv.org/abs/2509.24952) |
| |
|
| | ```bibtex |
| | @article{mabona2025novopeptidesequencingrescoring, |
| | title = {De novo peptide sequencing rescoring and FDR estimation with Winnow}, |
| | author = {Amandla Mabona and Jemma Daniel and Henrik Servais Janssen Knudsen and |
| | Rachel Catzel and Kevin Michael Eloff and Erwin M. Schoof and Nicolas |
| | Lopez Carranza and Timothy P. Jenkins and Jeroen Van Goey and |
| | Konstantinos Kalogeropoulos}, |
| | year = {2025}, |
| | eprint = {2509.24952}, |
| | archivePrefix = {arXiv}, |
| | primaryClass = {q-bio.QM}, |
| | url = {https://arxiv.org/abs/2509.24952}, |
| | } |
| | ``` |
| |
|
| | If you use this calibrator trained on HeLa Single Shot data, please cite: |
| |
|
| | ```bibtex |
| | @misc{instadeep_ltd_2025, |
| | author = { InstaDeep Ltd }, |
| | title = { winnow-helaqc-model (Revision b826cbb) }, |
| | year = 2025, |
| | url = { https://huggingface.co/InstaDeepAI/winnow-helaqc-model }, |
| | doi = { 10.57967/hf/6612 }, |
| | publisher = { Hugging Face } |
| | } |
| | ``` |
| |
|
| | If you use the `InstaNovo` model to generate predictions, please also cite: [InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments](https://doi.org/10.1038/s42256-025-01019-5) |
| |
|
| | ```bibtex |
| | @article{eloff_kalogeropoulos_2025_instanovo, |
| | title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale |
| | proteomics experiments}, |
| | author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell, |
| | Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen, |
| | Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J. |
| | and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars, |
| | Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and |
| | Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.}, |
| | year = 2025, |
| | month = {Mar}, |
| | day = 31, |
| | journal = {Nature Machine Intelligence}, |
| | doi = {10.1038/s42256-025-01019-5}, |
| | issn = {2522-5839}, |
| | url = {https://doi.org/10.1038/s42256-025-01019-5} |
| | } |
| | ``` |
| |
|
| | ## Contact |
| | For issues with dataset structure or usage in Winnow, please open an issue on the Winnow GitHub: https://github.com/instadeepai/winnow |