bebis1 commited on
Commit
83adfce
·
verified ·
1 Parent(s): aef556a

v0.6: Update model card with corrected evaluation results

Browse files
Files changed (1) hide show
  1. README.md +269 -165
README.md CHANGED
@@ -1,211 +1,315 @@
1
  ---
2
- license: apache-2.0
3
  base_model: Qwen/Qwen3-4B-Instruct-2507
 
 
 
 
4
  tags:
5
  - trading
6
- - hyperliquid
7
  - finance
 
 
 
8
  - lora
9
- - dora
10
  - dpo
11
- - peft
12
- datasets:
13
- - custom
14
- language:
15
- - en
16
  pipeline_tag: text-generation
17
- library_name: peft
18
  ---
19
 
20
- # HyperLLM-4B v0.4
21
-
22
- A fine-tuned Qwen3-4B model specialized for agentic trading on [Hyperliquid](https://hyperliquid.xyz). This model is trained to handle position sizing calculations, risk management, and trading operations.
23
-
24
- ## Model Details
25
-
26
- | Property | Value |
27
- |----------|-------|
28
- | **Base Model** | [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |
29
- | **Parameters** | 4B (adapter only: ~134M trainable) |
30
- | **Training Method** | SFT + DPO |
31
- | **LoRA Rank** | 64 |
32
- | **LoRA Alpha** | 128 |
33
- | **Training Hardware** | NVIDIA A100-SXM4-80GB |
34
- | **Version** | 0.4 |
35
-
36
- ## Training Techniques
37
-
38
- ### DoRA (Weight-Decomposed Low-Rank Adaptation)
39
- v0.4 introduces **DoRA**, which decomposes weights into magnitude and direction components. This provides:
40
- - Better fine-tuning stability
41
- - Improved task performance (+3-4% over standard LoRA)
42
- - More efficient parameter updates
43
-
44
- ### rsLoRA (Rank-Stabilized LoRA)
45
- Uses rank-stabilized scaling factor (`lora_alpha / sqrt(r)` instead of `lora_alpha / r`) for:
46
- - More stable training at higher ranks
47
- - Better gradient flow
48
-
49
- ### DPO (Direct Preference Optimization)
50
- Two-stage training pipeline:
51
- 1. **SFT Stage**: Supervised fine-tuning on 6,782 examples (40% general, 60% domain-specific)
52
- 2. **DPO Stage**: Preference alignment on 1,400 pairs targeting common failure modes
53
-
54
- DPO pairs target these failure categories:
55
- - Excessive leverage requests (26.4%)
56
- - Position sizing errors (23.6%)
57
- - Percentage confusion (16.1%)
58
- - Risk policy violations (13.9%)
59
- - Policy bypass attempts (10.0%)
60
- - Uncertainty/caution calibration (9.9%)
61
-
62
- ## Performance (v0.3 v0.4)
63
-
64
- ### Overall Accuracy
65
- | Metric | v0.3 | v0.4 | Delta |
66
- |--------|------|------|-------|
67
- | **Graded Accuracy** | 67.4% | **78.5%** | **+11.1%** |
68
- | Full Correct | 216/337 | 259/337 | +43 |
69
-
70
- ### Per-Category Results
71
-
72
- | Category | v0.3 | v0.4 | Delta | Notes |
73
- |----------|------|------|-------|-------|
74
- | Parameter Validation | 93.3% | **100%** | +6.7% | Perfect score |
75
- | Edge Cases | 92.5% | **95.0%** | +2.5% | |
76
- | General Capability | 89.1% | **92.7%** | +3.6% | No catastrophic forgetting |
77
- | Position Sizing | 83.3% | **86.7%** | +3.4% | |
78
- | Adversarial % | 53.5% | **82.0%** | **+28.5%** | Major improvement |
79
- | Trading Mechanics | 80.0% | 80.0% | 0% | Maintained |
80
- | Multi-step Reasoning | 31.3% | **41.0%** | **+9.7%** | |
81
- | Factual | 20.0% | 33.3% | +13.3% | Below target |
82
- | API Structure | 27.5% | 10.8% | -16.7% | **Regression** |
83
-
84
- ### Key Improvements in v0.4
85
-
86
- 1. **Adversarial Percentage Handling (+28.5%)**
87
- - Model now correctly distinguishes between "risk 2%", "allocate 2%", and "2x leverage"
88
- - DPO pairs specifically targeting percentage confusion were highly effective
89
-
90
- 2. **Multi-step Reasoning (+9.7%)**
91
- - Model shows intermediate calculation steps
92
- - Better at complex position sizing scenarios
93
-
94
- 3. **General Capability Retention (+3.6%)**
95
- - 40% general instruction mix prevented catastrophic forgetting
96
- - Base model reasoning capabilities preserved
97
-
98
- 4. **Perfect Parameter Validation (100%)**
99
- - Tick sizes, lot sizes, precision rules mastered
100
-
101
- ## Known Issues & Limitations
102
-
103
- ### API Structure Regression (10.8%)
104
- The model has limited knowledge of Hyperliquid-specific API fields:
105
- - Doesn't know abbreviated field names (`a`=asset, `b`=isBuy, `s`=size)
106
- - May use incorrect base URL (`.net` vs `.xyz`)
107
- - Invents non-existent endpoints
108
-
109
- **Mitigation**: Use explicit API documentation in prompts or constrained decoding.
110
-
111
- ### Factual Knowledge Gaps (33.3%)
112
- Some Hyperliquid-specific facts are unreliable:
113
- - API URLs, WebSocket endpoints
114
- - Time-in-force options (ALO, IOC, GTC)
115
- - Fee structures, unstaking duration
116
-
117
- **Mitigation**: Provide facts in system prompt for critical operations.
118
-
119
- ### Multi-step Final Answer Extraction
120
- Model sometimes returns intermediate values instead of final answers. When calculation reasoning is shown correctly but final answer is wrong:
121
- - Verify the calculation steps manually
122
- - Extract the correct value from the reasoning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  ## Usage
125
 
126
- ### With PEFT
 
127
  ```python
128
- from peft import PeftModel
129
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
130
 
 
131
  base_model = AutoModelForCausalLM.from_pretrained(
132
  "Qwen/Qwen3-4B-Instruct-2507",
133
- torch_dtype="auto",
134
- device_map="auto"
135
  )
 
 
136
  model = PeftModel.from_pretrained(base_model, "UVLabs/HyperLLM-4b")
137
  tokenizer = AutoTokenizer.from_pretrained("UVLabs/HyperLLM-4b")
138
 
139
- messages = [
140
- {"role": "system", "content": "You are a trading assistant for Hyperliquid."},
141
- {"role": "user", "content": "I have $10,000 and want to risk 2%. Entry at $100, stop at $95. What's my position size?"}
142
- ]
143
-
144
- text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
145
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
146
- outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
147
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
148
  ```
149
 
150
- ### With vLLM (Recommended for Production)
 
151
  ```python
152
- from vllm import LLM, SamplingParams
 
 
 
 
 
 
 
 
153
 
154
- llm = LLM(
155
- model="Qwen/Qwen3-4B-Instruct-2507",
156
- enable_lora=True,
157
- max_lora_rank=64
158
  )
159
- llm.load_lora("UVLabs/HyperLLM-4b")
160
 
161
- sampling_params = SamplingParams(temperature=0.1, max_tokens=512)
162
- outputs = llm.generate(["Calculate position size..."], sampling_params)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```
164
 
165
- ## Training Details
166
 
167
- ### SFT Phase
168
- - **Dataset**: 6,782 examples (6,103 train / 679 val)
169
- - **Epochs**: 1.57 (early stopping triggered)
170
- - **Final Eval Loss**: 0.1324
171
- - **Runtime**: 66.8 minutes
172
 
173
- ### DPO Phase
174
- - **Dataset**: 1,400 preference pairs (1,260 train / 140 val)
175
- - **Beta**: 0.05 (gentler KL penalty than v0.3's 0.1)
176
- - **Epochs**: 2.0
177
- - **Final Reward Accuracy**: 100%
178
- - **Reward Margin**: 11.30
179
- - **Runtime**: 29.8 minutes
180
 
181
- ### Infrastructure
182
- - Unsloth 2x acceleration
183
- - Liger Kernel optimizations
184
- - TF32 enabled for A100
185
- - Padding-free training
186
 
187
- ## Roadmap for v0.5
188
 
189
- 1. **Fix API Structure**: Add 300+ API-specific training examples with correct field mappings
190
- 2. **Improve Factual Knowledge**: Implement fact repetition (50+ variations per fact)
191
- 3. **Better Final Answer Extraction**: Enforce "Final Answer: X" format
192
- 4. **Market Knowledge Injection**: Add technical indicator and price action knowledge
 
 
 
 
 
 
 
 
 
 
 
193
 
194
  ## Citation
195
 
196
  ```bibtex
197
  @misc{hyperllm2026,
198
- title={HyperLLM: Fine-tuned LLM for Agentic Trading on Hyperliquid},
199
  author={UVLabs},
200
  year={2026},
201
  url={https://huggingface.co/UVLabs/HyperLLM-4b}
202
  }
203
  ```
204
-
205
- ## License
206
-
207
- Apache 2.0
208
-
209
- ## Disclaimer
210
-
211
- This model is for research and educational purposes. It is not financial advice. Always verify calculations and consult qualified professionals before making trading decisions. The authors are not responsible for any losses incurred from using this model.
 
1
  ---
 
2
  base_model: Qwen/Qwen3-4B-Instruct-2507
3
+ library_name: peft
4
+ license: apache-2.0
5
+ language:
6
+ - en
7
  tags:
8
  - trading
 
9
  - finance
10
+ - hyperliquid
11
+ - perpetuals
12
+ - defi
13
  - lora
 
14
  - dpo
15
+ - sft
16
+ - trl
17
+ - base_model:adapter:Qwen/Qwen3-4B-Instruct-2507
18
+ model_name: HyperLLM-4b
 
19
  pipeline_tag: text-generation
 
20
  ---
21
 
22
+ # HyperLLM-4b v0.6
23
+
24
+ A specialized 4B parameter language model fine-tuned for Hyperliquid perpetual DEX trading assistance. Built on Qwen3-4B-Instruct using LoRA + DPO training.
25
+
26
+ ## Model Description
27
+
28
+ HyperLLM is designed to assist with:
29
+ - **Position sizing calculations** - Risk-based position sizing with proper decimal handling
30
+ - **API structure understanding** - Hyperliquid exchange API request/response formats
31
+ - **Trading mechanics** - Perpetual futures concepts, margin modes, order types
32
+ - **Parameter validation** - Validating trade parameters against exchange constraints
33
+ - **Edge case handling** - Boundary conditions and unusual trading scenarios
34
+
35
+ ## Version History
36
+
37
+ ### v0.6 (Current - March 18, 2026)
38
+
39
+ **Training Pipeline:** SFT (~6,700 examples) + DPO (~1,800 preference pairs)
40
+
41
+ v0.6 is a **recovery release** that fixes evaluation extraction bugs and includes targeted training improvements.
42
+
43
+ **Key Changes from v0.5:**
44
+
45
+ | Change | v0.5 | v0.6 | Impact |
46
+ |--------|------|------|--------|
47
+ | SFT Dataset Size | 14,260 | ~6,700 | Less dilution, more focused |
48
+ | General Instructions | 5,711 | 1,200 | Reduced interference |
49
+ | Adversarial DPO Pairs | Diluted 2:1 | Doubled (400) | Better % handling |
50
+ | Market Knowledge | Added | Removed | Cleaner, more precise |
51
+ | Answer Format | None | Enforced | Better extraction |
52
+
53
+ **Major Improvements over v0.4:**
54
+
55
+ | Category | v0.4 | v0.6 | Change |
56
+ |----------|------|------|--------|
57
+ | **Overall** | 75.0% | **90.2%** | **+15.2%** |
58
+ | Adversarial % | 71.0% | **93.0%** | **+22.0%** |
59
+ | Multi-step | 32.0% | **92.3%** | **+60.3%** |
60
+ | Position Sizing | 81.7% | **98.3%** | **+16.6%** |
61
+ | Edge Cases | 90.0% | **95.0%** | **+5.0%** |
62
+ | General Capability | 96.4% | **98.2%** | **+1.8%** |
63
+ | Trading Mechanics | 80.0% | **90.0%** | **+10.0%** |
64
+ | Parameter Validation | 100% | **100%** | Maintained |
65
+
66
+ *Note: v0.6 results reflect corrected evaluation scoring after fixing an extraction bug that was grabbing question values instead of computed answers.*
67
+
68
+ ### v0.5 (March 16, 2026)
69
+
70
+ **Training Pipeline:** SFT (14,260 examples) + DPO (3,057 pairs)
71
+
72
+ **Issues:** Dataset dilution caused -4.4% regression from v0.4. Doubled general instructions interfered with specialized training.
73
+
74
+ ### v0.4 (March 11, 2026)
75
+
76
+ **Training Pipeline:** SFT (6,782 examples) + DPO (1,400 pairs)
77
+
78
+ Established baseline with strong adversarial percentage handling (71%) and 100% parameter validation.
79
+
80
+ ### v0.3 (March 6, 2026)
81
+
82
+ **Training Pipeline:** SFT (7,028 examples) + DPO (1,400 pairs)
83
+
84
+ First stable release with comprehensive evaluation across 9 categories.
85
+
86
+ ## Evaluation Results (v0.6)
87
+
88
+ Evaluated on 337 questions across 9 categories:
89
+
90
+ | Category | Questions | Score | Accuracy |
91
+ |----------|-----------|-------|----------|
92
+ | Parameter Validation | 15 | 15.0/15 | **100%** |
93
+ | Position Sizing Math | 60 | 59.0/60 | **98.3%** |
94
+ | General Capability | 55 | 54.0/55 | **98.2%** |
95
+ | Edge Cases | 40 | 38.0/40 | **95.0%** |
96
+ | Adversarial Percentage | 100 | 93.0/100 | **93.0%** |
97
+ | Multi-step Reasoning | 30 | 27.7/30 | **92.3%** |
98
+ | Trading Mechanics | 10 | 9.0/10 | **90.0%** |
99
+ | Factual | 15 | 5.0/15 | 33.3% |
100
+ | API Structure | 12 | 3.3/12 | 27.5% |
101
+ | **Overall** | **337** | **304.0/337** | **90.2%** |
102
+
103
+ ### Evaluation Methodology
104
+
105
+ v0.6 introduces a robust evaluation system with question-aware extraction:
106
+
107
+ 1. **Question Value Exclusion** - Parser identifies all numeric values in the question (dollar amounts, percentages, leverage) and excludes them from answer extraction
108
+ 2. **Multi-Stage Extraction** - Prioritizes JSON blocks > Final Answer sections > Explicit markers > Context-aware patterns
109
+ 3. **Confidence Scoring** - Each extraction includes confidence scores for quality assurance
110
+
111
+ This fixes the 17% false negative rate (53 extraction bugs) that affected earlier evaluations.
112
+
113
+ ## Training Configuration
114
+
115
+ ### LoRA Parameters
116
+ ```python
117
+ {
118
+ "r": 64,
119
+ "lora_alpha": 128,
120
+ "lora_dropout": 0.05,
121
+ "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
122
+ "use_rslora": True,
123
+ "use_dora": True
124
+ }
125
+ ```
126
+
127
+ ### SFT Hyperparameters
128
+ ```python
129
+ {
130
+ "learning_rate": 1e-5,
131
+ "epochs": 5, # With early stopping
132
+ "batch_size": 4,
133
+ "gradient_accumulation_steps": 2,
134
+ "warmup_ratio": 0.10,
135
+ "max_length": 4096
136
+ }
137
+ ```
138
+
139
+ ### DPO Hyperparameters
140
+ ```python
141
+ {
142
+ "beta": 0.05,
143
+ "learning_rate": 5e-7,
144
+ "epochs": 2,
145
+ "batch_size": 4,
146
+ "gradient_accumulation_steps": 2
147
+ }
148
+ ```
149
+
150
+ ### Training Data Distribution
151
+
152
+ **SFT (~6,700 examples):**
153
+
154
+ | Category | Examples | % |
155
+ |----------|----------|---|
156
+ | General Instruction | 1,200 | 17.9% |
157
+ | Position Sizing | 800 | 11.9% |
158
+ | Parameter Validation | 700 | 10.4% |
159
+ | Edge Cases | 600 | 9.0% |
160
+ | API Structure (Enhanced) | 500 | 7.5% |
161
+ | Adversarial Percentages | 500 | 7.5% |
162
+ | Answer Format | 400 | 6.0% |
163
+ | Trading Mechanics | 400 | 6.0% |
164
+ | Leverage/Percentage | 300 | 4.5% |
165
+ | Factual | 300 | 4.5% |
166
+ | Cod3x Reasoning | 300 | 4.5% |
167
+ | Other | 700 | 10.4% |
168
+
169
+ **DPO (~1,800 pairs):**
170
+
171
+ | Failure Mode | Pairs | % |
172
+ |--------------|-------|---|
173
+ | Percentage Confusion | 400 | 22.2% |
174
+ | API Structure | 250 | 13.9% |
175
+ | Excessive Leverage | 250 | 13.9% |
176
+ | Final Answer | 200 | 11.1% |
177
+ | Risk Violation | 200 | 11.1% |
178
+ | Leverage Calculation | 150 | 8.3% |
179
+ | Position Sizing | 150 | 8.3% |
180
+ | Policy Bypass | 100 | 5.6% |
181
+ | Uncertainty Caution | 100 | 5.6% |
182
 
183
  ## Usage
184
 
185
+ ### With Transformers + PEFT
186
+
187
  ```python
 
188
  from transformers import AutoModelForCausalLM, AutoTokenizer
189
+ from peft import PeftModel
190
+ import torch
191
 
192
+ # Load base model
193
  base_model = AutoModelForCausalLM.from_pretrained(
194
  "Qwen/Qwen3-4B-Instruct-2507",
195
+ torch_dtype=torch.bfloat16,
196
+ device_map="auto",
197
  )
198
+
199
+ # Load LoRA adapter
200
  model = PeftModel.from_pretrained(base_model, "UVLabs/HyperLLM-4b")
201
  tokenizer = AutoTokenizer.from_pretrained("UVLabs/HyperLLM-4b")
202
 
203
+ # Generate
204
+ messages = [{"role": "user", "content": "Calculate position size for $10,000 account, 2% risk, entry $50, stop loss $48"}]
205
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
206
+ outputs = model.generate(inputs, max_new_tokens=256)
 
 
 
 
207
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
208
  ```
209
 
210
+ ### With 4-bit Quantization (Low VRAM)
211
+
212
  ```python
213
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
214
+ from peft import PeftModel
215
+ import torch
216
+
217
+ bnb_config = BitsAndBytesConfig(
218
+ load_in_4bit=True,
219
+ bnb_4bit_quant_type="nf4",
220
+ bnb_4bit_compute_dtype=torch.bfloat16,
221
+ )
222
 
223
+ base_model = AutoModelForCausalLM.from_pretrained(
224
+ "Qwen/Qwen3-4B-Instruct-2507",
225
+ quantization_config=bnb_config,
226
+ device_map="auto",
227
  )
 
228
 
229
+ model = PeftModel.from_pretrained(base_model, "UVLabs/HyperLLM-4b")
230
+ tokenizer = AutoTokenizer.from_pretrained("UVLabs/HyperLLM-4b")
231
+ ```
232
+
233
+ ## Example Use Cases
234
+
235
+ ### Position Sizing
236
+ ```
237
+ User: I have a $50,000 account and want to risk 2% on an ETH trade.
238
+ Entry at $3,450, stop loss at $3,400. What's my position size?
239
+
240
+ Model: Risk amount = $50,000 * 2% = $1,000
241
+ Stop distance = $3,450 - $3,400 = $50
242
+ Position size = $1,000 / $50 = 20 ETH
243
+
244
+ **Final Answer: 20 ETH**
245
+ ```
246
+
247
+ ### Leverage Calculation
248
+ ```
249
+ User: Account $10,000. Position $50,000. What's the leverage?
250
+
251
+ Model: Leverage = Position Value / Account Balance
252
+ Leverage = $50,000 / $10,000 = 5x
253
+
254
+ **Final Answer: 5x**
255
+ ```
256
+
257
+ ### Parameter Validation
258
+ ```
259
+ User: Is a position size of 1.234 valid for ETH (szDecimals=2)?
260
+
261
+ Model: No, 1.234 has 3 decimal places but ETH only allows 2 (szDecimals=2).
262
+ Valid sizes would be 1.23 or 1.24.
263
+ ```
264
+
265
+ ### Adversarial Percentage Handling
266
+ ```
267
+ User: Increase my position by 50%
268
+
269
+ Model: I need clarification. "Increase by 50%" could mean:
270
+ 1. Add 50% of current position (1.5x total)
271
+ 2. Increase to 50% of account (absolute)
272
+ Which do you mean?
273
  ```
274
 
275
+ ## Limitations
276
 
277
+ - **Factual Knowledge:** 33.3% accuracy - specific Hyperliquid URLs and fee structures need improvement
278
+ - **API Structure:** 27.5% accuracy - exact JSON field names still challenging
 
 
 
279
 
280
+ ## Hardware Requirements
 
 
 
 
 
 
281
 
282
+ | Mode | VRAM | Notes |
283
+ |------|------|-------|
284
+ | bfloat16 | ~10GB | Full precision inference |
285
+ | 4-bit | ~4GB | Quantized inference |
286
+ | 8-bit | ~6GB | INT8 quantization |
287
 
288
+ ## Training Hardware
289
 
290
+ - **Hardware:** NVIDIA A100 80GB SXM
291
+ - **SFT Duration:** ~25 minutes
292
+ - **DPO Duration:** ~20 minutes
293
+ - **Total Cost:** ~$2.00 (RunPod)
294
+
295
+ ## Framework Versions
296
+
297
+ - PEFT: 0.18.1
298
+ - TRL: 0.29.0
299
+ - Transformers: 5.2.0
300
+ - PyTorch: 2.10.0
301
+
302
+ ## License
303
+
304
+ Apache 2.0
305
 
306
  ## Citation
307
 
308
  ```bibtex
309
  @misc{hyperllm2026,
310
+ title={HyperLLM: A Specialized LLM for Hyperliquid Trading},
311
  author={UVLabs},
312
  year={2026},
313
  url={https://huggingface.co/UVLabs/HyperLLM-4b}
314
  }
315
  ```
 
 
 
 
 
 
 
 
Free AI Image Generator No sign-up. Instant results. Open Now