Qwen
/

Text Generation
Transformers
Safetensors
qwen3_moe
conversational

The Qwen3-235B-A22B model is not as effective as the Qwen3-32B model.

#35
by czqqq - opened

I evaluated the models in the Qwen3 series based on Humaneval and found that the Qwen3-32B model performs better than the larger 235B model, with the non-thinking results even better than the thinking results. Therefore, I suspect that the dataset has been contaminated.
Subsequently, I conducted evaluations using LiveCodeBench, and the results show that the model performance of Qwen3-32B is superior, while the performance of Qwen3-235B-A22B is far from satisfactory.
The evaluation results from LiveCodeBench are as follows:
Evaluation Period: 2025-02-01 ~ 2025-05-01 (131 data points)

models enable_thinking Pass@1
qwen3-235B-A22B False 0.274809160305
qwen3-30B-A3B False 0.320610687022
qwen3-32B False 0.358778625954
qwen3-235B-A22B True 0.297709923664
qwen3-30B-A3B True 0.335877862595
qwen3-32B True 0.335877862595

I strongly suspect that there is an issue with the weights on Hugging Face. I directly accessed the Qwen official interface and clearly felt that the Qwen3-235B-A22B model performs well. Another possibility is that I deployed the model using the latest version of the VLLM framework (8 * A100), which is not well supported by the framework. However, I deployed it strictly according to the parameters provided by the Qwen official.

Sign up or log in to comment