That's also an interesting one, but I think I found it, but it's a little different from what I remembered because it's not just the scientific notation that helps but more so the addition of positional tokens:
". In particular, the model fails to learn addition of five-digit numbers when
using subwords (e.g., “32”), and it struggles to learn with character-level representations (e.g., “3 2”). By introducing position tokens (e.g., “3 10e1 2”), the model learns to accurately add and subtract numbers up to 60 digits. We conclude
that modern pretrained language models can easily learn arithmetic from very
few examples, as long as we use the proper surface representation."
It might be http://nlg.csie.ntu.edu.tw/~cjchen/papers/eacl2023.pdf
They show an increase from 65% to 70% on their "comparing numbers" benchmark.