Have any benchmarks been made that use this paper’s definition? I follow the ARC prize and Humanity’s Last Exam, but I don’t know how closely they would map to this paper’s methods.
Edit: Probably not, since it was published less than a week ago :-) I’ll be watching for benchmarks.
Edit: Probably not, since it was published less than a week ago :-) I’ll be watching for benchmarks.