Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> [...'man\u0303ana'].length => 7 // WTF? Why doesn't this match the behavior of [...'\u{1F4A9}'].length ?

The key thing to remember is that iteration over Unicode strings only makes sense as iteration over code points, not UCS-2 characters, not bytes, not grapheme clusters. The JS String iterator was very deliberately made to iterate over code points. That length reports UCS-2 characters is a historical mistake. That padding is operating on UCS-2 characters is probably a reflection of the fact that the operation isn't well-defined beyond ASCII.



> The key thing to remember is that iteration over Unicode strings only makes sense as iteration over code points, not UCS-2 characters, not bytes, not grapheme clusters.

There are tons of situations where interating over grapheme clusters is what you want to do.


And tons of situations where you don't want neither of two (e.g. nfd vs. nfc). Cairo graphics library has utilities for text rendering, explicitly called "toy text" functions in reference, leaving serious rendering to Pango. That's fair. Languages should not call unicode strings "unicode strings" if these are not covered in detail by special libraries with distinct names for ucp/ucs/etc lengths, iterators, etc. There is no such thing as string length or "char" anymore. String is blank or non-blank, anything beyond that is too complex to be part of any stdlib. Even "blank" is not so obvious today.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: