Basic Arithmetic Properties in the Space of Language Model Prompts
Large pre-trained neural Language Models (LLMs) that can effectively utilize enormous amounts of unlabeled textual data have recently changed the whole field of Natural Language Processing. By utilizing prompting techniques enabled by the in-context learning capabilities, LLMs have been shown to perform on par with dedicated models trained for downstream tasks. One such a task is numerical reasoning and, in particular, the ability to conduct basic arithmetic operations. The question we wish to answer is whether the basic properties of arithmetic operations, such as the commutative property, hold in the space of LLM prompts – does asking the LLM to compute 13 + 37 vs 37 + 13 result, on average, in the same outcome? In contrast to most previous works, which reported Accuracy only, we take a closer look (MAE, Pearson’s R) at the error distribution to better understand the performance with regard to prompt perturbations and scaling laws.