MATH-Perturb

Benchmarking LLMs' Math Reasoning Abilities
against Hard Perturbations

Kaixuan Huang1, Jiacheng Guo1, Zihao Li1, Xiang Ji1, Jiawei Ge1, Wenzhe Li1, Yingqing Guo1,
Tianle Cai1, Hui Yuan1, Runzhe Wang1, Yue Wu1, Ming Yin1, Shange Tang1,
Yangsibo Huang2, Chi Jin1, Xinyun Chen2, Chiyuan Zhang2, Mengdi Wang1
1Princeton University, 2Google
overview

Left: The overview of MATH-Perturb Benchmark. Right: An example of the original problem, its simple perturbation, its hard perturbation, and the corresponding model responses that overfit the short-cut solution. The simple perturbation to the problem is non-essential, so the modified problem can be solved using the same method as the original problem. The hard perturbation changes the problem fundamentally and it requires more difficult problem-solving skills. The shortcut solution can solve the original problem and its simple perturbation but fails on the hard perturbation.

Introduction

Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, we construct MATH-P-Simple and MATH-P-Hard, each consisting of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et. al., 2021):

  • for MATH-P-Simple, we make simple perturbations, i.e., non-essential modifications to the problem, ensuring that the modified problem can be solved using the same method as the original problem.
  • for MATH-P-Hard, we make hard perturbations, i.e., small but fundamental modifications to the problem so that the modified problem cannot be solved using the same method as the original problem. Instead, it requires deeper math understanding and harder problem-solving skills.

performance We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.

Leaderboard

Note: The results are averaged among 3 indepedent runs. For DeepSeek-R1 series, we use the suggested configuration (temperature=0.6, top_p=0.95) and set the max length to 65536 (64k) tokens.

Citation

@article{huang2025math,
        title={{MATH-Perturb}: Benchmarking {LLMs}' Math Reasoning Abilities against Hard Perturbations},
        author={Kaixuan Huang and Jiacheng Guo and Zihao Li and Xiang Ji and Jiawei Ge and Wenzhe Li and Yingqing Guo and Tianle Cai and Hui Yuan and Runzhe Wang and Yue Wu and Ming Yin and Shange Tang and Yangsibo Huang and Chi Jin and Xinyun Chen and Chiyuan Zhang and Mengdi Wang},
        journal={arXiv preprint arXiv:2502.06453},
        year={2025}
      }