Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's definitely not the case for me. I have models trained on the same dataset which is 14mb (though I needed to tweak more for the 1.5b).

1.5b outperforms it here if trained long enough - in this case 1-2 months as I was doing it all for free on Colab.

One of the big things was batching - it seems like nobody really tries todo larger batches the biggest models, and without batching but while having little data the model was getting stuck.



You trained (finetuned) GPT2 for 1-2 months on 14mb of data?

I don't understand how this doesn't massively overfit. How long of these 1-2 months was the model actually training?


I train for maybe ~12 hours a day, some days, especially around Christmas I didn't. I also lost a lot of days when trying out different stuff or when the weights didn't save to drive before the Colab timed out.

Having said that, I was training the full model with an accumulated batch size for a while so it was taking > 10min per step. I've also been using pretty low learning rates for most of the latter stages.

Overall the model is currently at ~11k steps and the loss can actually go down further but after playing with different checkpoints last week, the best one didnt seem to be the newest one so I left it at that one.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: