It is still great work though, a robust masking representation architecture that works across modalities.
> We apply data2vec separately to speech, images and text and it outperformed the previous best single-purpose algorithms for computer vision and speech and it is competitive on NLP tasks.