Yeah I felt the same way. Although perhaps at a higher scale the fine-tuning can make a bigger difference? The results go against this hypothesis but at least OpenAI states that GPT-3 only needs 200 examples, so who knows. In fact I wonder how well GPT-3 would do against this when fine-tuned on just 200 examples.