They did not test on the data that they tested, that's not what he wrote.

DetroitThrow · 2025-09-30T18:22:15 1759256535

They synthetically generated 290k examples and kept 10k of them for testing.

It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.

The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.

bangaladore · 2025-09-30T19:25:00 1759260300

Thanks, rereading it makes it clear that you are correct.