hks350d commited on
Commit
e4c847b
·
verified ·
1 Parent(s): 71a54db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -65
README.md CHANGED
@@ -12,19 +12,65 @@ base_model:
12
 
13
  A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS.
14
 
 
 
 
 
 
 
 
 
 
15
  ## What this model expects (most important)
16
 
17
  - Input type: a unified git diff as plain text.
18
- - Wrap the diff in a Markdown code fence labeled `diff` for best results:
19
- ```
20
- ```diff
21
- <your unified git diff here>
22
- ```
23
- ```
24
  - The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.).
25
  - Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
26
  - Language of response: English only. The system prompt enforces English output.
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ### Chat template (Gemma 3)
29
  The model was trained and inferred using Gemma’s chat template. Conceptually:
30
 
@@ -52,83 +98,185 @@ Training data (chat format) examples were stored like:
52
 
53
  ## Quick usage
54
 
55
- ### CLI (included in this repo)
56
- - From a staged diff in your current repo:
57
-
58
- ```bash
59
- python commit_msg_cli.py run --from-git --staged --adapter \
60
- --model google/gemma-3-270m-it \
61
- --adapter-path ./adapters
62
- ```
63
 
64
- - From a diff file:
65
-
66
- ```bash
67
- python commit_msg_cli.py run --diff path/to/diff.txt --adapter \
68
- --model google/gemma-3-270m-it \
69
- --adapter-path ./adapters
70
- ```
71
-
72
- The CLI will wrap your diff with the expected prompt/template and return a single-line message.
73
-
74
- ### Programmatic (MLX)
75
 
76
  ```python
77
- from mlx_lm.utils import load as mlx_load
78
- from mlx_lm.generate import generate
79
- from chat_template_utils import get_gemma_tokenizer, format_commit_message_prompt
80
- from mlx_lm import sample_utils
81
-
82
- model_name = "google/gemma-3-270m-it"
83
- adapter_path = "./adapters" # or a specific run dir
84
-
85
- diff_text = """diff --git a/app.py b/app.py
86
- index e69de29..f4c3b4a 100644
87
- --- a/app.py
88
- +++ b/app.py
89
- @@ -0,0 +1,3 @@
90
- +def add(a, b):
91
- + return a + b
92
- +"""
93
-
94
- # Load with adapter if available
95
- model, tok = mlx_load(model_name, adapter_path=adapter_path)
96
-
97
- # Use Gemma chat template for the prompt
98
- tokenizer = get_gemma_tokenizer(model_name)
99
- prompt = format_commit_message_prompt(diff_text, tokenizer, include_generation_prompt=True)
100
-
101
- sampler = sample_utils.make_sampler(temp=0.7, top_p=0.9, top_k=64)
102
- out = generate(model, tok, prompt=prompt, max_tokens=100, verbose=False, sampler=sampler)
103
- print(out)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ```
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ## Examples
107
 
108
- Input (user message content):
 
 
 
109
 
110
  ```diff
111
- diff --git a/app.py b/app.py
112
- index e69de29..f4c3b4a 100644
113
- --- a/app.py
114
- +++ b/app.py
115
- @@ -0,0 +1,3 @@
116
- +def add(a, b):
117
- + return a + b
 
 
 
118
  +
 
 
 
 
 
 
 
119
  ```
120
 
121
  Possible outputs:
122
- - Add simple add() helper
123
- - Implement add function
124
- - Introduce add utility for two-number sum
125
 
126
  ## Training summary
127
 
128
  - Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned).
129
  - Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response.
130
- - Data: Local JSONL converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. In this repo, the dataset used (`data/train_gpt-oss-20b.jsonl`) was parsed and converted to a chat messages format. This particular set is Python-focused.
 
131
  - Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
 
132
 
133
  ## Evaluation
134
 
@@ -160,4 +308,4 @@ The repository’s `format_commit_message_prompt` builds the correct prompt for
160
  ## License and credits
161
 
162
  - Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms.
163
- - Fine-tuning code: MLX and utilities in this repository. See repository license for details.
 
12
 
13
  A small, fast model specialized to turn a git diff into a concise, English commit message. Built on top of `google/gemma-3-270m-it` and fine-tuned with LoRA using MLX on macOS.
14
 
15
+ ## Requirements
16
+
17
+ - macOS with Apple Silicon (for MLX)
18
+ - Python 3.8+
19
+ - Required packages:
20
+ ```bash
21
+ pip install mlx-lm transformers
22
+ ```
23
+
24
  ## What this model expects (most important)
25
 
26
  - Input type: a unified git diff as plain text.
27
+ - Wrap the diff in a Markdown code fence labeled `diff` for best results.
 
 
 
 
 
28
  - The diff should look like the output of `git diff --no-color` (hunk headers like `@@`, `+`/`-` line prefixes, file headers, etc.).
29
  - Keep diffs reasonably sized. The training/CLI path truncates diffs to ~3,000 characters and trains/infers with a context window of ~2,048 tokens. Extremely large diffs should be summarized or sampled.
30
  - Language of response: English only. The system prompt enforces English output.
31
 
32
+ ### Training Data Format
33
+
34
+ This model was trained on the `data/train_gpt-oss-20b.jsonl` dataset in this repository. The training data uses Gemma's chat template format with the following exact structure:
35
+
36
+ **User prompt format (as seen in training data):**
37
+ ```
38
+ Generate a concise and descriptive commit message for this git diff:
39
+
40
+ ```diff
41
+ diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
42
+ index <HASH>..<HASH> 100644
43
+ --- a/src/ossos-pipeline/scripts/update_astrometry.py
44
+ +++ b/src/ossos-pipeline/scripts/update_astrometry.py
45
+ @@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
46
+ cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
47
+ cutout.zmag = new_zp
48
+
49
+ + if math.fabs(new_zp - old_zp) > 0.3:
50
+ + logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp))
51
+ +
52
+ try:
53
+ - (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
54
+ + (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
55
+ (x, y) = cutout.get_observed_coordinates((x, y))
56
+ except:
57
+ logging.warn("Failed to do photometry.")
58
+ ```
59
+ ```
60
+
61
+ **Important:** To get the best results, match this exact format including:
62
+ - The instruction text: "Generate a concise and descriptive commit message for this git diff:"
63
+ - The double newline after the instruction
64
+ - The diff wrapped in triple backticks with `diff` language tag
65
+ - Hash placeholders shown as `<HASH>..<HASH>` in the diff headers
66
+
67
+ ### Chat template (Gemma 3)
68
+ The model was trained using Gemma's chat template with the system prompt enforcing English-only responses. The conceptual structure is:
69
+
70
+ - system: "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
71
+ - user: The exact format shown above
72
+ - assistant: single-line commit message (target)
73
+
74
  ### Chat template (Gemma 3)
75
  The model was trained and inferred using Gemma’s chat template. Conceptually:
76
 
 
98
 
99
  ## Quick usage
100
 
101
+ ### Python Script (MLX)
 
 
 
 
 
 
 
102
 
103
+ Here's a complete standalone script to generate commit messages using this model:
 
 
 
 
 
 
 
 
 
 
104
 
105
  ```python
106
+ #!/usr/bin/env python3
107
+ """
108
+ Standalone script to generate git commit messages using the fine-tuned Gemma model.
109
+ Requires: mlx-lm, transformers
110
+ Install with: pip install mlx-lm transformers
111
+ """
112
+
113
+ import subprocess
114
+ import sys
115
+ from mlx_lm import load, generate
116
+ from transformers import AutoTokenizer
117
+
118
+ def get_staged_diff():
119
+ """Get the staged git diff from the current repository."""
120
+ try:
121
+ result = subprocess.run(
122
+ ['git', 'diff', '--staged', '--no-color'],
123
+ capture_output=True, text=True, check=True
124
+ )
125
+ return result.stdout.strip()
126
+ except subprocess.CalledProcessError:
127
+ print("Error: Could not get git diff. Make sure you're in a git repository with staged changes.")
128
+ return None
129
+
130
+ def format_prompt(diff_text, tokenizer):
131
+ """Format the diff into the exact training data format."""
132
+ system_prompt = "You are a helpful assistant that generates git commit messages. Always respond in English only. Do not use any other language."
133
+ user_message = f"Generate a concise and descriptive commit message for this git diff:\n\n```diff\n{diff_text}\n```"
134
+
135
+ # Format using Gemma chat template
136
+ messages = [
137
+ {"role": "system", "content": system_prompt},
138
+ {"role": "user", "content": user_message}
139
+ ]
140
+
141
+ prompt = tokenizer.apply_chat_template(
142
+ messages,
143
+ tokenize=False,
144
+ add_generation_prompt=True
145
+ )
146
+ return prompt
147
+
148
+ def generate_commit_message(diff_text, model_path="your-username/git-diff-to-commit-gemma-3-270m"):
149
+ """Generate a commit message from a git diff."""
150
+
151
+ # Load model and tokenizer
152
+ print("Loading model...")
153
+ model, mlx_tokenizer = load(model_path)
154
+ hf_tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-270m-it")
155
+
156
+ # Format the prompt
157
+ prompt = format_prompt(diff_text, hf_tokenizer)
158
+
159
+ # Generate response
160
+ print("Generating commit message...")
161
+ response = generate(
162
+ model,
163
+ mlx_tokenizer,
164
+ prompt=prompt,
165
+ max_tokens=100,
166
+ temp=0.7,
167
+ top_p=0.9,
168
+ verbose=False
169
+ )
170
+
171
+ # Extract just the generated part (after the prompt)
172
+ generated_text = response[len(prompt):].strip()
173
+
174
+ # Return the first non-empty line
175
+ lines = [line.strip() for line in generated_text.split('\n') if line.strip()]
176
+ return lines[0] if lines else "Unable to generate commit message"
177
+
178
+ def main():
179
+ """Main function - can be used with staged diff or provided diff text."""
180
+
181
+ if len(sys.argv) > 1:
182
+ # Use provided diff file
183
+ diff_file = sys.argv[1]
184
+ try:
185
+ with open(diff_file, 'r') as f:
186
+ diff_text = f.read().strip()
187
+ except FileNotFoundError:
188
+ print(f"Error: File {diff_file} not found.")
189
+ return
190
+ else:
191
+ # Get staged diff from git
192
+ diff_text = get_staged_diff()
193
+ if not diff_text:
194
+ print("No staged changes found. Stage some changes with 'git add' first.")
195
+ return
196
+
197
+ if not diff_text:
198
+ print("No diff content to process.")
199
+ return
200
+
201
+ # Generate and print commit message
202
+ commit_message = generate_commit_message(diff_text)
203
+ print(f"\nSuggested commit message:")
204
+ print(f" {commit_message}")
205
+
206
+ if __name__ == "__main__":
207
+ main()
208
  ```
209
 
210
+ ### Usage Examples
211
+
212
+ 1. **Generate from staged git changes:**
213
+ ```bash
214
+ python generate_commit.py
215
+ ```
216
+
217
+ 2. **Generate from a diff file:**
218
+ ```bash
219
+ python generate_commit.py my_changes.diff
220
+ ```
221
+
222
+ 3. **Use in your own code:**
223
+ ```python
224
+ from generate_commit import generate_commit_message
225
+
226
+ diff = """diff --git a/app.py b/app.py
227
+ index e69de29..f4c3b4a 100644
228
+ --- a/app.py
229
+ +++ b/app.py
230
+ @@ -0,0 +1,3 @@
231
+ +def add(a, b):
232
+ + return a + b
233
+ """
234
+
235
+ message = generate_commit_message(diff)
236
+ print(message)
237
+ ```
238
+
239
  ## Examples
240
 
241
+ Input (user message content as formatted in training data):
242
+
243
+ ```
244
+ Generate a concise and descriptive commit message for this git diff:
245
 
246
  ```diff
247
+ diff --git a/src/ossos-pipeline/scripts/update_astrometry.py b/src/ossos-pipeline/scripts/update_astrometry.py
248
+ index <HASH>..<HASH> 100644
249
+ --- a/src/ossos-pipeline/scripts/update_astrometry.py
250
+ +++ b/src/ossos-pipeline/scripts/update_astrometry.py
251
+ @@ -159,8 +159,11 @@ def recompute_mag(mpc_in):
252
+ cutout = image_slice_downloader.download_cutout(reading, needs_apcor=True)
253
+ cutout.zmag = new_zp
254
+
255
+ + if math.fabs(new_zp - old_zp) > 0.3:
256
+ + logging.warning("Large change in zeropoint detected: {} -> {}".format(old_zp, new_zp))
257
  +
258
+ try:
259
+ - (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=old_zp)
260
+ + (x, y, mag, merr) = cutout.get_observed_magnitude(zmag=new_zp)
261
+ (x, y) = cutout.get_observed_coordinates((x, y))
262
+ except:
263
+ logging.warn("Failed to do photometry.")
264
+ ```
265
  ```
266
 
267
  Possible outputs:
268
+ - fix: use new_zp instead of old_zp for magnitude calculation and add zeropoint change warning
269
+ - fix: correct zeropoint usage in photometry and add warning for large zeropoint changes
270
+ - refactor: update magnitude calculation to use new zeropoint and add change detection
271
 
272
  ## Training summary
273
 
274
  - Base model: `google/gemma-3-270m-it` (Gemma 3, 270M, instruction-tuned).
275
  - Method: LoRA fine-tuning with MLX (`mlx_lm lora`). Prompt masking was enabled so the model learns from the assistant response.
276
+ - **Training data**: `data/train_gpt-oss-20b.jsonl` in this repository - a dataset converted to chat format with diffs fenced as ```diff and English, single-line commit messages as targets. This dataset is Python-focused.
277
+ - Data format: Each training example uses the exact user prompt format shown above in the chat template structure.
278
  - Context/config highlights: max sequence length ~2048 tokens; diffs truncated to ~3,000 characters during preprocessing/inference to be model-friendly.
279
+ - **Important**: To achieve best results, match the exact input format used in the training data.
280
 
281
  ## Evaluation
282
 
 
308
  ## License and credits
309
 
310
  - Base model: Google Gemma 3 (`google/gemma-3-270m-it`). Use subject to the Gemma license terms.
311
+ - Fine-tuning code: MLX and utilities in this repository. See repository license for details.