Solved for this benchmark... and at what cost to the rest of the system?
These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.
These tasks are interesting because they're existence proofs of generalization failure. Like the haystack problem, direct solutions here are much less interesting than structural improvements that address the class of failure.