Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note that Claude 2 scores 71.2% zero-shot on the python coding benchmark HumanEval, which is better than GPT-4, which scores 67.0%. Is there already real-world experience with its programming performance?


GPT-4 out in the wild's (reproducible) performance appears to be much higher than 67. Testing from 3/15 (presumably on the 0314 model) seems to be at 85.36% (https://twitter.com/amanrsanger/status/1635751764577361921). And the linked paper from my post(https://doi.org/10.48550/arXiv.2305.01210) got a pass@1 of 88.4 from GPT-4 recently (May? June?).


I have found just using it in the web interface comperable to OpenAI. But the context window makes a huge difference. I can dump alot more files in ( entire schema, sample records etc)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: