|
--- |
|
license: gemma |
|
datasets: |
|
- hks350d/commit-message-generation |
|
- Maxscha/commitbench |
|
language: |
|
- en |
|
base_model: |
|
- google/gemma-3-270m-it |
|
--- |
|
# Git Diff -> Commit Message (Gemma 3 270M IT + LoRA) |
|
|
|
A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS. |
|
|
|
## Requirements |
|
|
|
- macOS with Apple Silicon (for MLX) |
|
- Python 3.8+ |
|
- Required packages: |
|
```bash |
|
pip install mlx-lm transformers |
|
``` |
|
|
|
## What this model expects (most important) |
|
|
|
- Input type: a unified git diff as plain text. |
|
- Wrap the diff in a Markdown code fence labeled `diff` for best results. |
|
- The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.). |
|
- Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled. |
|
- Language of response: English only. The system prompt enforces English output. |
|
|
|
### Training Data Format |
|
|
|
This model was trained on the `data/train_gpt-oss-20b.jsonl` dataset in this repository. The training data uses Gemma's chat template format with the following exact structure: |
|
|
|
**User prompt format (as seen in training data):** |
|
``` |
|
Generate a concise and descriptive commit message for this git diff: |
|
|
|
```diff |
|
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py |
|
index <HASH>..<HASH> 100644 |
|
--- a/src/ossos-pipeline/scripts/update_astrometry.py |
|
+++ b/src/ossos-pipeline/scripts/update_astrometry.py |
|
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in): |
|
cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True) |
|
cutout.zmag = new_zp |
|
|
|
+ if math.fabs(new_zp - old_zp) > 0.3: |
|
+ logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp)) |
|
+ |
|
try: |
|
- (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp) |
|
+ (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp) |
|
(x, y) = cutout.get_observed_coordinates((x, y)) |
|
except: |
|
logging.warn("Failed to do photometry.") |
|
``` |
|
``` |
|
|
|
**Important:** To get the best results, match this exact format including: |
|
- The instruction text: "Generate a concise and descriptive commit message for this git diff:" |
|
- The double newline after the instruction |
|
- The diff wrapped in triple backticks with `diff` language tag |
|
- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers |
|
|
|
### Chat template (Gemma 3) |
|
The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is: |
|
|
|
- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language." |
|
- user: The exact format shown above |
|
- assistant: single-line commit message (target) |
|
|
|
### Chat template (Gemma 3) |
|
The model was trained and inferred using Gemma’s chat template. Conceptually: |
|
|
|
- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language." |
|
- user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences |
|
- assistant: single-line commit message (target) |
|
|
|
Training data (chat format) examples were stored like: |
|
|
|
```json |
|
{ |
|
"messages": [ |
|
{"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."}, |
|
{"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n<diff text>\n```"}, |
|
{"role": "assistant", "content": "<single-line commit message>"} |
|
] |
|
} |
|
``` |
|
|
|
## Output |
|
|
|
- A single-line commit subject, in English. |
|
- The CLI post-processes the generation and returns the first non-empty line. |
|
- Keep it concise and descriptive; optionally target ~50–72 characters where possible. |
|
|
|
## Quick usage |
|
|
|
### Python Script (MLX) |
|
|
|
Here's a complete standalone script to generate commit messages using this model: |
|
|
|
```python |
|
#!/usr/bin/env python3 |
|
""" |
|
Standalone script to generate git commit messages using the fine-tuned Gemma model. |
|
Requires: mlx-lm, transformers |
|
Install with: pip install mlx-lm transformers |
|
""" |
|
|
|
import subprocess |
|
import sys |
|
from mlx_lm import load, generate |
|
from transformers import AutoTokenizer |
|
|
|
def get_staged_diff(): |
|
"""Get the staged git diff from the current repository.""" |
|
try: |
|
result = subprocess.run( |
|
['git', 'diff', '--staged', '--no-color'], |
|
capture_output=True, text=True, check=True |
|
) |
|
return result.stdout.strip() |
|
except subprocess.CalledProcessError: |
|
print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.") |
|
return None |
|
|
|
def format_prompt(diff_text, tokenizer): |
|
"""Format the diff into the exact training data format.""" |
|
system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language." |
|
user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```" |
|
|
|
# Format using Gemma chat template |
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": user_message} |
|
] |
|
|
|
prompt = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
return prompt |
|
|
|
def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"): |
|
"""Generate a commit message from a git diff.""" |
|
|
|
# Load model and tokenizer |
|
print("Loading model...") |
|
model, mlx_tokenizer = load(model_path) |
|
hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it") |
|
|
|
# Format the prompt |
|
prompt = format_prompt(diff_text, hf_tokenizer) |
|
|
|
# Generate response |
|
print("Generating commit message...") |
|
response = generate( |
|
model, |
|
mlx_tokenizer, |
|
prompt=prompt, |
|
max_tokens=100, |
|
temp=0.7, |
|
top_p=0.9, |
|
verbose=False |
|
) |
|
|
|
# Extract just the generated part (after the prompt) |
|
generated_text = response[len(prompt):].strip() |
|
|
|
# Return the first non-empty line |
|
lines = [line.strip() for line in generated_text.split('\n') if line.strip()] |
|
return lines[0] if lines else "Unable to generate commit message" |
|
|
|
def main(): |
|
"""Main function - can be used with staged diff or provided diff text.""" |
|
|
|
if len(sys.argv) > 1: |
|
# Use provided diff file |
|
diff_file = sys.argv[1] |
|
try: |
|
with open(diff_file, 'r') as f: |
|
diff_text = f.read().strip() |
|
except FileNotFoundError: |
|
print(f"Error: File {diff_file} not found.") |
|
return |
|
else: |
|
# Get staged diff from git |
|
diff_text = get_staged_diff() |
|
if not diff_text: |
|
print("No staged changes found. Stage some changes with 'git add' first.") |
|
return |
|
|
|
if not diff_text: |
|
print("No diff content to process.") |
|
return |
|
|
|
# Generate and print commit message |
|
commit_message = generate_commit_message(diff_text) |
|
print(f"\nSuggested commit message:") |
|
print(f" {commit_message}") |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
### Usage Examples |
|
|
|
1. **Generate from staged git changes:** |
|
```bash |
|
python generate_commit.py |
|
``` |
|
|
|
2. **Generate from a diff file:** |
|
```bash |
|
python generate_commit.py my_changes.diff |
|
``` |
|
|
|
3. **Use in your own code:** |
|
```python |
|
from generate_commit import generate_commit_message |
|
|
|
diff = """diff --git a/app.py b/app.py |
|
index e69de29..f4c3b4a 100644 |
|
--- a/app.py |
|
+++ b/app.py |
|
@@ -0,0 +1,3 @@ |
|
+def add(a, b): |
|
+ return a + b |
|
""" |
|
|
|
message = generate_commit_message(diff) |
|
print(message) |
|
``` |
|
|
|
## Examples |
|
|
|
Input (user message content as formatted in training data): |
|
|
|
``` |
|
Generate a concise and descriptive commit message for this git diff: |
|
|
|
```diff |
|
diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py |
|
index <HASH>..<HASH> 100644 |
|
--- a/src/ossos-pipeline/scripts/update_astrometry.py |
|
+++ b/src/ossos-pipeline/scripts/update_astrometry.py |
|
@@ -159,8 +159,11 @@ def recompute_mag(mpc_in): |
|
cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True) |
|
cutout.zmag = new_zp |
|
|
|
+ if math.fabs(new_zp - old_zp) > 0.3: |
|
+ logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp)) |
|
+ |
|
try: |
|
- (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp) |
|
+ (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp) |
|
(x, y) = cutout.get_observed_coordinates((x, y)) |
|
except: |
|
logging.warn("Failed to do photometry.") |
|
``` |
|
|
|
|
|
Possible outputs: |
|
- fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning |
|
- fix: correct zeropoint usage in photometry and add warning for large zeropoint changes |
|
- refactor: update magnitude calculation to use new zeropoint and add change detection |
|
|
|
## Training summary |
|
|
|
- Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned). |
|
- Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response. |
|
- **Training data**: `data/train_gpt-oss-20b.jsonl` in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused. |
|
- Data format: Each training example uses the exact user prompt format shown above in the chat template structure. |
|
- Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly. |
|
- **Important**: To achieve best results, match the exact input format used in the training data. |
|
|
|
## Evaluation |
|
|
|
- The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under `evaluation_results/`. |
|
|
|
## Limitations and risks |
|
|
|
- Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes. |
|
- Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos. |
|
- Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous. |
|
- Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context. |
|
- Language: System prompt enforces English responses. |
|
|
|
## Intended use |
|
|
|
- Assist developers by proposing a concise commit subject from a given git diff. |
|
- Not a replacement for human judgment; review messages before committing. |
|
|
|
## How to format inputs yourself |
|
|
|
If you’re not using the CLI helpers, follow this structure with the Gemma chat template: |
|
|
|
- system: English-only instruction for commit message generation (see above) |
|
- user: instruction + the diff in ```diff code fences |
|
- assistant: the target single-line subject (for training) or left empty (for inference) |
|
|
|
The repository’s `format_commit_message_prompt` builds the correct prompt for Gemma 3. |
|
|
|
## License and credits |
|
|
|
- Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms. |
|
- Fine-tuning code: MLX and utilities in this repository. See repository license for details. |
|
|