Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have any benchmarks been made that use this paper’s definition? I follow the ARC prize and Humanity’s Last Exam, but I don’t know how closely they would map to this paper’s methods.

Edit: Probably not, since it was published less than a week ago :-) I’ll be watching for benchmarks.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: