MATH-Perturb

Benchmarking LLMs' Math Reasoning Abilities
against Hard Perturbations

Kaixuan Huang¹, Jiacheng Guo¹, Zihao Li¹, Xiang Ji¹, Jiawei Ge¹, Wenzhe Li¹, Yingqing Guo¹,
Tianle Cai¹, Hui Yuan¹, Runzhe Wang¹, Yue Wu¹, Ming Yin¹, Shange Tang¹,
Yangsibo Huang², Chi Jin¹, Xinyun Chen², Chiyuan Zhang², Mengdi Wang¹

¹Princeton University, ²Google
Accepted in ICML 2025

Paper arXiv Code

🤗

Dataset

🏆

Leaderboard

🌐

Twitter

Left: The overview of MATH-Perturb Benchmark. Right: An example of the original problem, its simple perturbation, its hard perturbation, and the corresponding model responses that overfit the short-cut solution. The simple perturbation to the problem is non-essential, so the modified problem can be solved using the same method as the original problem. The hard perturbation changes the problem fundamentally and it requires more difficult problem-solving skills. The shortcut solution can solve the original problem and its simple perturbation but fails on the hard perturbation.

Introduction

Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, we construct MATH-P-Simple and MATH-P-Hard, each consisting of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et. al., 2021):

for MATH-P-Simple, we make simple perturbations, i.e., non-essential modifications to the problem, ensuring that the modified problem can be solved using the same method as the original problem.
for MATH-P-Hard, we make hard perturbations, i.e., small but fundamental modifications to the problem so that the modified problem cannot be solved using the same method as the original problem. Instead, it requires deeper math understanding and harder problem-solving skills.

We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.

Leaderboard

Note: For DeepSeek-R1 series, we use the suggested configuration (temperature=0.6, top_p=0.95) and set the max length to 65536 (64k) tokens. For QwQ-32B, we adopt a max length of 32768 (32k) with temperature=0.6, top_k=40, top_p=0.95. For Claude-3.7-Sonnet extended thinking mode, we use thinking budget tokens = 56000 and max tokens=64000.

MATH-Perturb

Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

Introduction

Leaderboard

Citation

Benchmarking LLMs' Math Reasoning Abilities
against Hard Perturbations