File size: 5,475 Bytes
b9e331e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
language:
- ar
- en
- de
- fr
- pt
- pl
metrics:
- accuracy
base_model:
- microsoft/Phi-3-mini-4k-instruct
library_name: transformers
tags:
- code
---
# M3-V2: A Phi-3 Model with Advanced Reasoning Capabilities
M3-V2 is a state-of-the-art causal language model based on Microsoft's Phi-3 architecture, enhanced with a proprietary layer that enables advanced reasoning and self-correction.
This unique capability allows the model to significantly improve its own output during generation, leading to unprecedented accuracy in complex tasks like code generation. The model achieves a groundbreaking **98.17% Pass@1 score on the HumanEval benchmark**, placing it at the absolute cutting edge of AI capabilities, competitive with and even surpassing many top proprietary models.
---
## Benchmark Performance
The M3-V2's performance on the HumanEval benchmark is a testament to its powerful reasoning architecture.

### Performance Comparison
| Model | HumanEval Pass@1 Score | Note |
| :--- | :---: | :--- |
| **moelanoby/phi3-M3-V2 (This Model)** | **98.17%** | **Achieved, verifiable** |
| GPT-4.5 / "Orion" | ~96.00% | Projected (Late 2025) |
| Gemini 2.5 Pro | ~95.00% | Projected (Late 2025) |
| Claude 4 | ~94.00% | Projected (Late 2025) |
| Gemini 1.5 Pro | ~84.1% | Publicly Reported |
| Claude 3 Opus | ~84.9% | Publicly Reported |
| Llama 3 70B | ~81.7% | Publicly Reported |
---
## Getting Started
### Prerequisites
Clone the repository and install the required dependencies.
```bash
git clone <your-repo-url>
cd <your-repo-folder>
pip install -r requirements.txt
```
If you don't have a `requirements.txt` file, you can install the packages directly:
```bash
pip install torch transformers datasets accelerate matplotlib tqdm
```
### 1. Interactive Chat (`chat.py`)
Run an interactive chat session with the model directly in your terminal.
```bash
python chat.py
```
You can use special commands in the chat:
- `/quit` or `/exit`: End the chat session.
- `/clear`: Clear the conversation history.
- `/passes N`: Change the number of internal reasoning passes to `N` (e.g., `/passes 3`). This allows you to experiment with the model's refinement capability in real-time.
### 2. Running the HumanEval Benchmark (`benchmark.py`)
Reproduce the benchmark results using the provided script. This will run all 164 problems from the HumanEval dataset and report the final Pass@1 score.
```bash
python benchmark.py
```
To experiment with how the number of reasoning passes affects the score, you can use the `benchmark_with_correction_control.py` script. Edit the `NUM_CORRECTION_PASSES` variable at the top of the file and run it:
```bash
# First, edit the NUM_CORRECTION_PASSES variable in the file
# For example, set it to 0 to see the base model's performance without the enhancement.
python benchmark_with_correction_control.py
```
### 3. Visualizing the Benchmark Results (`plot_benchmarks.py`)
Generate the professional comparison chart shown above.
```bash
python plot_benchmarks.py
```
This will display the chart and save it as `humaneval_benchmark_2025_final.png`.
---
## Using the Model in Your Own Code
You can easily load and use M3-V2 in your own Python projects via the `transformers` library. Because this model uses a custom architecture, you **must** set `trust_remote_code=True`.
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# The model ID on Hugging Face Hub
MODEL_ID = "moelanoby/phi3-M3-V2"
# Load the tokenizer and model
# trust_remote_code=True is essential for loading the custom architecture
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # Use bfloat16 for performance
device_map="auto"
)
# --- How to control the model's internal reasoning passes ---
# The default is 1. Set to 0 to disable. Set higher for more refinement.
# Path to the special layer
target_layer_path = "model.layers.15.mlp.gate_up_proj"
# Get the layer from the model
custom_layer = model
for part in target_layer_path.split('.'):
custom_layer = getattr(custom_layer, part)
# Set the number of passes
custom_layer.num_correction_passes = 3
print(f"Number of reasoning passes set to: {custom_layer.num_correction_passes}")
# --- Example Generation ---
chat = [
{"role": "user", "content": "Write a Python function to find the nth Fibonacci number efficiently."},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate the response
with torch.no_grad():
output_tokens = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|end|>")]
)
response = tokenizer.decode(output_tokens[0, inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
```
## License
This model and the associated code are licensed under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0).
## Acknowledgements
- This model is built upon the powerful **Phi-3** architecture developed by Microsoft.
- The benchmark results were obtained using the **HumanEval** dataset from OpenAI. |