Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, we construct MATH-P-Simple and MATH-P-Hard, each consisting of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et. al., 2021):
- for MATH-P-Simple, we make simple perturbations, i.e., non-essential modifications to the problem, ensuring that the modified problem can be solved using the same method as the original problem.
- for MATH-P-Hard, we make hard perturbations, i.e., small but fundamental modifications to the problem so that the modified problem cannot be solved using the same method as the original problem. Instead, it requires deeper math understanding and harder problem-solving skills.
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.