The discussion here has tended to focus on either one or the other of two different facts:
You absolutely need to be aware of and manage language and locale settings diligently. This is a big topic, but generally, whenever possible, use UTF8. And you generally don't want to change your desktop OS or applications' locale or language settings other than that.
However, if you are processing data files and are aware of the implications, and can get away with treating data as binary instead of character streams -- for example, any of the command sequences that use sort/uniq for uniqueness or set union/intersection/difference -- then using C locale is way faster. As in, the difference between doing something in an hour on a big machine and rewriting a whole pipeline in Hadoop. A good fact to know.
It's obvious I (the author) didn't make the simultaneous truth of both these points clear. I'll rewrite.
You absolutely need to be aware of and manage language and locale settings diligently. This is a big topic, but generally, whenever possible, use UTF8. And you generally don't want to change your desktop OS or applications' locale or language settings other than that.
However, if you are processing data files and are aware of the implications, and can get away with treating data as binary instead of character streams -- for example, any of the command sequences that use sort/uniq for uniqueness or set union/intersection/difference -- then using C locale is way faster. As in, the difference between doing something in an hour on a big machine and rewriting a whole pipeline in Hadoop. A good fact to know.
It's obvious I (the author) didn't make the simultaneous truth of both these points clear. I'll rewrite.