Have any benchmarks been made that use this paper’s definition? I follow the ARC...

Have any benchmarks been made that use this paper’s definition? I follow the ARC prize and Humanity’s Last Exam, but I don’t know how closely they would map to this paper’s methods.

Edit: Probably not, since it was published less than a week ago :-) I’ll be watching for benchmarks.