rocq-language-theorem-embeddings / README.md

kdizzled

Update README.md

3930e7d verified 4 months ago

preview code

raw

history blame

3.88 kB

metadata

license: apache-2.0
language:
  - en
base_model:
  - microsoft/codebert-base

Model Card for Model ID

RocqStar Ranker Embedder

A self‑attentive embedding model for premise / proof selection in Rocq‑based interactive theorem proving (ITP). RocqStar is fine‑tuned from CodeBERT so that distances in the embedding space correlate with proof similarity rather than with surface‑level statement overlap. It replaces BM25/Jaccard similarity in retrieval‑augmented generation (RAG) pipelines such as CoqPilot, leading to higher proof success rates on the IMM‑300 benchmark.

Model Details

Model Description

RocqStar is a 125 M‑parameter Transformer encoder (768‑dim hidden size) with multi‑head self‑attention and a learned self‑attentive pooling head. It is trained with an InfoNCE contrastive objective so that the cosine similarity of two statement embeddings approximates the similarity of their proofs, measured by a hybrid Levenshtein + Jaccard distance. The model takes tokenised Rocq (Gallina) theorem statements as input and outputs a 768‑d embedding.

Model type: Transformer encoder with self‑attentive pooling
Language(s): Rocq / Coq (Gallina) syntax (tokens)
License:: Apache‑2.0
Fine‑tuned from: microsoft/codebert-base

Model Sources

Repository: https://github.com/JetBrains-Research/big-rocq
Paper: (coming soon)

Bias, Risks, and Limitations

Domain limitation: Trained only on Rocq projects; performance degrades on other languages or heavily tactic‑macro‑based proofs.
Data leakage: May memorise proofs present in training data and surface them verbatim.
Proof similarity assumption: While improved over BM25, cosine distance may still fail on very long or highly creative proofs.

How to Get Started

Go to https://github.com/JetBrains-Research/big-rocq. In subdirectory ranker-server you can find the server that you need to run to be able to use the model from the CoqPilot plugin. Also it may be used as a reference example of how to do inference with the model.

Training Details

Training Data

Corpus: 76 524 mined statements + proofs from 4 large Rocq projects.
Positive/negative mining: Proof distance < τ₊ (0.35) → positive; > τ₋ (0.6) → negative; 4 hard negatives per anchor.

Training Procedure

Objective: InfoNCE with temperature T = 0.07
Batch size: 512 pairs (effective) – gradient accumulation = 8×64
Optimizer / LR: AdamW, lr = 4e‑6, linear warm‑up 10 %, 22k steps
Hardware: 1× NVIDIA H100 GPU, 160 GB RAM, 14 h wall‑clock

Evaluation

Testing Data, Factors & Metrics

Dataset: IMM‑300 (300 Rocq theorems) from the IMM project
Metrics: Downstream proof success rate of CoqPilot when given top‑7 retrieved premises; averaged over 12 generations.

Results

Model (back‑end)	Bucket	Baseline Jaccard	RocqStar
GPT‑4o	≤ 4	48 ± 5 %	51 ± 5 %
GPT‑4o	5–8	18 ± 4 %	25 ± 3 %
GPT‑4o	9–20	11 ± 4 %	11 ± 5 %
Claude 3.5	≤ 4	58 ± 5 %	61 ± 4 %
Claude 3.5	5–8	28 ± 5 %	36 ± 5 %
Claude 3.5	9–20	16 ± 5 %	21 ± 5 %

Summary

RocqStar delivers consistent gains, up to 28% relative improvement over token‑overlap retrieval, especially for medium‑length theorems where proof similarity diverges most from statement similarity.