Qwen/Qwen3-235B-A22B · The Qwen3-235B-A22B model is not as effective as the Qwen3-32B model.

I evaluated the models in the Qwen3 series based on Humaneval and found that the Qwen3-32B model performs better than the larger 235B model, with the non-thinking results even better than the thinking results. Therefore, I suspect that the dataset has been contaminated.
Subsequently, I conducted evaluations using LiveCodeBench, and the results show that the model performance of Qwen3-32B is superior, while the performance of Qwen3-235B-A22B is far from satisfactory.
The evaluation results from LiveCodeBench are as follows:
Evaluation Period: 2025-02-01 ~ 2025-05-01 (131 data points)

models	enable_thinking	Pass@1
qwen3-235B-A22B	False	0.274809160305
qwen3-30B-A3B	False	0.320610687022
qwen3-32B	False	0.358778625954
qwen3-235B-A22B	True	0.297709923664
qwen3-30B-A3B	True	0.335877862595
qwen3-32B	True	0.335877862595

I strongly suspect that there is an issue with the weights on Hugging Face. I directly accessed the Qwen official interface and clearly felt that the Qwen3-235B-A22B model performs well. Another possibility is that I deployed the model using the latest version of the VLLM framework (8 * A100), which is not well supported by the framework. However, I deployed it strictly according to the parameters provided by the Qwen official.