git-diff-to-commit-gemma-3-270m / README.md

Update README.md

432ff9a verified 6 days ago

11.8 kB

	---
	license: gemma
	datasets:
	- hks350d/commit-message-generation
	- Maxscha/commitbench
	language:
	- en
	base_model:
	- google/gemma-3-270m-it
	---
	# Git Diff -> Commit Message (Gemma 3 270M IT + LoRA)

	A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS.

	## Requirements

	- macOS with Apple Silicon (for MLX)
	- Python 3.8+
	- Required packages:
	```bash
	pip install mlx-lm transformers
	```

	## What this model expects (most important)

	- Input type: a unified git diff as plain text.
	- Wrap the diff in a Markdown code fence labeled `diff` for best results.
	- The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.).
	- Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
	- Language of response: English only. The system prompt enforces English output.

	### Training Data Format

	This model was trained on the `data/train_gpt-oss-20b.jsonl` dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:

	User prompt format (as seen in training data):
	```
	Generate a concise and descriptive commit message for this git diff:

	```diff
	diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
	index <HASH>..<HASH> 100644
	--- a/src/ossos-pipeline/scripts/update_astrometry.py
	+++ b/src/ossos-pipeline/scripts/update_astrometry.py
	@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
	cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
	cutout.zmag = new_zp

	+ if math.fabs(new_zp - old_zp) > 0.3:
	+ logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp))
	+
	try:
	- (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
	+ (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
	(x, y) = cutout.get_observed_coordinates((x, y))
	except:
	logging.warn("Failed to do photometry.")
	```
	```

	Important: To get the best results, match this exact format including:
	- The instruction text: "Generate a concise and descriptive commit message for this git diff:"
	- The double newline after the instruction
	- The diff wrapped in triple backticks with `diff` language tag
	- Hash placeholders shown as `<HASH>..<HASH>` in the diff headers

	### Chat template (Gemma 3)
	The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:

	- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
	- user: The exact format shown above
	- assistant: single-line commit message (target)

	### Chat template (Gemma 3)
	The model was trained and inferred using Gemma’s chat template. Conceptually:

	- system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
	- user: "Generate a concise and descriptive commit message for this git diff:" + the diff wrapped in ```diff fences
	- assistant: single-line commit message (target)

	Training data (chat format) examples were stored like:

	```json
	{
	"messages": [
	{"role": "system", "content": "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."},
	{"role": "user", "content": "Generate a concise and descriptive commit message for this git diff:\n\n```diff\n<diff text>\n```"},
	{"role": "assistant", "content": "<single-line commit message>"}
	]
	}
	```

	## Output

	- A single-line commit subject, in English.
	- The CLI post-processes the generation and returns the first non-empty line.
	- Keep it concise and descriptive; optionally target ~50–72 characters where possible.

	## Quick usage

	### Python Script (MLX)

	Here's a complete standalone script to generate commit messages using this model:

	```python
	#!/usr/bin/env python3
	"""
	Standalone script to generate git commit messages using the fine-tuned Gemma model.
	Requires: mlx-lm, transformers
	Install with: pip install mlx-lm transformers
	"""

	import subprocess
	import sys
	from mlx_lm import load, generate
	from transformers import AutoTokenizer

	def get_staged_diff():
	"""Get the staged git diff from the current repository."""
	try:
	result = subprocess.run(
	['git', 'diff', '--staged', '--no-color'],
	capture_output=True, text=True, check=True
	)
	return result.stdout.strip()
	except subprocess.CalledProcessError:
	print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
	return None

	def format_prompt(diff_text, tokenizer):
	"""Format the diff into the exact training data format."""
	system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
	user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"

	# Format using Gemma chat template
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": user_message}
	]

	prompt = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	return prompt

	def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
	"""Generate a commit message from a git diff."""

	# Load model and tokenizer
	print("Loading model...")
	model, mlx_tokenizer = load(model_path)
	hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")

	# Format the prompt
	prompt = format_prompt(diff_text, hf_tokenizer)

	# Generate response
	print("Generating commit message...")
	response = generate(
	model,
	mlx_tokenizer,
	prompt=prompt,
	max_tokens=100,
	temp=0.7,
	top_p=0.9,
	verbose=False
	)

	# Extract just the generated part (after the prompt)
	generated_text = response[len(prompt):].strip()

	# Return the first non-empty line
	lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
	return lines[0] if lines else "Unable to generate commit message"

	def main():
	"""Main function - can be used with staged diff or provided diff text."""

	if len(sys.argv) > 1:
	# Use provided diff file
	diff_file = sys.argv[1]
	try:
	with open(diff_file, 'r') as f:
	diff_text = f.read().strip()
	except FileNotFoundError:
	print(f"Error: File {diff_file} not found.")
	return
	else:
	# Get staged diff from git
	diff_text = get_staged_diff()
	if not diff_text:
	print("No staged changes found. Stage some changes with 'git add' first.")
	return

	if not diff_text:
	print("No diff content to process.")
	return

	# Generate and print commit message
	commit_message = generate_commit_message(diff_text)
	print(f"\nSuggested commit message:")
	print(f" {commit_message}")

	if __name__ == "__main__":
	main()
	```

	### Usage Examples

	1. Generate from staged git changes:
	```bash
	python generate_commit.py
	```

	2. Generate from a diff file:
	```bash
	python generate_commit.py my_changes.diff
	```

	3. Use in your own code:
	```python
	from generate_commit import generate_commit_message

	diff = """diff --git a/app.py b/app.py
	index e69de29..f4c3b4a 100644
	--- a/app.py
	+++ b/app.py
	@@ -0,0 +1,3 @@
	+def add(a, b):
	+ return a + b
	"""

	message = generate_commit_message(diff)
	print(message)
	```

	## Examples

	Input (user message content as formatted in training data):

	```
	Generate a concise and descriptive commit message for this git diff:

	```diff
	diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
	index <HASH>..<HASH> 100644
	--- a/src/ossos-pipeline/scripts/update_astrometry.py
	+++ b/src/ossos-pipeline/scripts/update_astrometry.py
	@@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
	cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
	cutout.zmag = new_zp

	+ if math.fabs(new_zp - old_zp) > 0.3:
	+ logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp))
	+
	try:
	- (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
	+ (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
	(x, y) = cutout.get_observed_coordinates((x, y))
	except:
	logging.warn("Failed to do photometry.")
	```


	Possible outputs:
	- fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
	- fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
	- refactor: update magnitude calculation to use new zeropoint and add change detection

	## Training summary

	- Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned).
	- Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response.
	- Training data: `data/train_gpt-oss-20b.jsonl` in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused.
	- Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
	- Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
	- Important: To achieve best results, match the exact input format used in the training data.

	## Evaluation

	- The repo includes a lightweight evaluation that compares generated messages to a reference using a simple string similarity (SequenceMatcher) across multiple runs (varying the RNG seed). Results and artifacts are saved under `evaluation_results/`.

	## Limitations and risks

	- Diff size sensitivity: Very large diffs may be truncated; consider summarizing large changes.
	- Domain bias: Training set emphasized Python diffs; behavior may be better for Python-heavy repos.
	- Hallucinations: As with any LLM, may produce generic or mismatched messages if the diff is ambiguous.
	- Security: Do not feed secrets; generated text may inadvertently paraphrase sensitive context.
	- Language: System prompt enforces English responses.

	## Intended use

	- Assist developers by proposing a concise commit subject from a given git diff.
	- Not a replacement for human judgment; review messages before committing.

	## How to format inputs yourself

	If you’re not using the CLI helpers, follow this structure with the Gemma chat template:

	- system: English-only instruction for commit message generation (see above)
	- user: instruction + the diff in ```diff code fences
	- assistant: the target single-line subject (for training) or left empty (for inference)

	The repository’s `format_commit_message_prompt` builds the correct prompt for Gemma 3.

	## License and credits

	- Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms.
	- Fine-tuning code: MLX and utilities in this repository. See repository license for details.