It's definitely not the case for me. I have models trained on the same dataset w...

MasterScrat · on Jan 23, 2020

You trained (finetuned) GPT2 for 1-2 months on 14mb of data?

I don't understand how this doesn't massively overfit. How long of these 1-2 months was the model actually training?

Tenoke · on Jan 23, 2020

I train for maybe ~12 hours a day, some days, especially around Christmas I didn't. I also lost a lot of days when trying out different stuff or when the weights didn't save to drive before the Colab timed out.

Having said that, I was training the full model with an accumulated batch size for a while so it was taking > 10min per step. I've also been using pretty low learning rates for most of the latter stages.

Overall the model is currently at ~11k steps and the loss can actually go down further but after playing with different checkpoints last week, the best one didnt seem to be the newest one so I left it at that one.