These things are done on a risk framework model. Small models are more obviously predictable, it's either going to work or it's very clear it's output is too unreliable to use.
These larger models carry a different risk as this is no longer the case, it's less visible, they can game the checks, so they can seem reliable/aligned but they're not.