The Qwen3-235B-A22B model is not as effective as the Qwen3-32B model.
I evaluated the models in the Qwen3 series based on Humaneval and found that the Qwen3-32B model performs better than the larger 235B model, with the non-thinking results even better than the thinking results. Therefore, I suspect that the dataset has been contaminated.
Subsequently, I conducted evaluations using LiveCodeBench, and the results show that the model performance of Qwen3-32B is superior, while the performance of Qwen3-235B-A22B is far from satisfactory.
The evaluation results from LiveCodeBench are as follows:
Evaluation Period: 2025-02-01 ~ 2025-05-01 (131 data points)
models | enable_thinking | Pass@1 |
---|---|---|
qwen3-235B-A22B | False | 0.274809160305 |
qwen3-30B-A3B | False | 0.320610687022 |
qwen3-32B | False | 0.358778625954 |
qwen3-235B-A22B | True | 0.297709923664 |
qwen3-30B-A3B | True | 0.335877862595 |
qwen3-32B | True | 0.335877862595 |
I strongly suspect that there is an issue with the weights on Hugging Face. I directly accessed the Qwen official interface and clearly felt that the Qwen3-235B-A22B model performs well. Another possibility is that I deployed the model using the latest version of the VLLM framework (8 * A100), which is not well supported by the framework. However, I deployed it strictly according to the parameters provided by the Qwen official.