It also provides a way to post data on the public web in an obfusticated way, that a human can read but automated search tools are likely not looking for.
Great method if you had short human-readable information information that you didnt want AI to train on ;)
I wrote a tiny pipeline to check, and it seems styled Unicode has a very modest effect on an LLM's ability to understand text. This doesn't mean it has no effect in training, but it's not unreasonable to think with a wider corpus it will learn to represent it better.
Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64.
I removed most count = 1 samples to make the comment shorter.
There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.
Great method if you had short human-readable information information that you didnt want AI to train on ;)