BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2

8 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

Owner 7 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

MrDevolver

7 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! 😉

BasedBase

Owner 7 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! 😉

I am happy you are finding success with the model! This has been the goal, get as close as possible to the large models performance in a small base model so you have both speed and do not have to spend $5k+ on a rig to be able to run the larger models at 20tk/s. A double distill would be interesting but I think that would require distill the 480b into the 235b and then distill that into the 30b. I'm not sure how well that would work but I may look into a double distill.

BasedBase

Owner 7 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! 😉

If you downloaded the Q8 quant please re-download, the original one I uploaded was the wrong model. I uploaded the correct one that will perform much better.

MrDevolver

7 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! 😉

If you downloaded the Q8 quant please re-download, the original one I uploaded was the wrong model. I uploaded the correct one that will perform much better.

Thanks for a heads up, but I'm only running Q4_K_S, I think that's as high as I can go with my current hardware. I could probably do Q4_K_M, but I guess going higher would significantly slow it down for very little benefit of extra quality.

Still the Q4_K_S was able to one shot actually playable PACMAN clone. That's something you can't do even with some proprietary models!
Funny thing is that the game had couple of small issues, mostly visual ones, but those couldn't be properly fixed even by Gemini 2.5 Pro, Claude 4.1 Opus nor GPT 5 High which are currently considered to be the top among the proprietary models. This made me believe that this little 30B A3B model is REALLY pushing to be the best it could possibly be even at Q4_K_S. Still, I wonder if the Q8 could fix it, but again I can't run that high level of quant myself at the moment.

Koitenshin

6 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! 😉

I am happy you are finding success with the model! This has been the goal, get as close as possible to the large models performance in a small base model so you have both speed and do not have to spend $5k+ on a rig to be able to run the larger models at 20tk/s. A double distill would be interesting but I think that would require distill the 480b into the 235b and then distill that into the 30b. I'm not sure how well that would work but I may look into a double distill.

Another thing that helps is the quant process. I haven't looked into the 'why' of it, but MXFP4 really helps with coherence and memory usage.

@lovedheart made an MXFP4 version of Qwen3 32B that I can run at decent speeds on an 8GB VRAM + 32GB RAM system.

MegatronMyLeader

4 days ago

•

edited 4 days ago

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 : I am running your Q8 and it is a beast. It surprised me so much I came looking for this comments forum so I could tell you. I am not just talking about coding but writing and other tasks. The model performs like it is a much larger model. I have the full Qwen3 Coder 480B at Q6 on another machine so I am able to compare. It has a can-do attitude, excellent composition, and low code error rate. It exhibits a startling degree of "initiative" and intelligence in the way it tackles problems, accepts suggestions, or creates pitch deck slides. Very impressive!

BasedBase

Owner 3 days ago

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 : I am running your Q8 and it is a beast. It surprised me so much I came looking for this comments forum so I could tell you. I am not just talking about coding but writing and other tasks. The model performs like it is a much larger model. I have the full Qwen3 Coder 480B at Q6 on another machine so I am able to compare. It has a can-do attitude, excellent composition, and low code error rate. It exhibits a startling degree of "initiative" and intelligence in the way it tackles problems, accepts suggestions, or creates pitch deck slides. Very impressive!

Thank you I appreciate the kind words!

BasedBase
/

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2

Thanks!