It also provides a way to post data on the public web in an obfusticated way, that a human can read but automated search tools are likely not looking for.
Great method if you had short human-readable information information that you didnt want AI to train on ;)
I wrote a tiny pipeline to check, and it seems styled Unicode has a very modest effect on an LLM's ability to understand text. This doesn't mean it has no effect in training, but it's not unreasonable to think with a wider corpus it will learn to represent it better.
Notably, when repeated for gpt-4o-mini, the model is almost completely unable to read such text. I wonder if this correlates to a model's ability to decode Base64.
I removed most count = 1 samples to make the comment shorter.
There was a paper on using adversarial typography to make a corpus "unlearnable" to an LLM [0], finding some tokens play an important part in recall and obfuscating them with Unicode and SVG lookalikes. If you're interested, I suggest taking a look.
Unicode obsfucation tricks trigger modern content filters faster than you can blink. Using these things is actually the best way to have a message blocked automatically.
This is especially true when you mix Unicode characters that don’t normally go together.
(Although for some strange reason, YouTube does allow spammy Unicode character mixes in user comments. I don’t know why)