In languages like C “string” isn’t a proper data structure, it’s a `char` array,...

account42 · on Jan 28, 2022

Having your strings be conceptually made up of UTF-8 code units makes them no less strings than those made up of Unicode code points. As this article shows, working with code points is often not the right abstraction anyway and you need to up all the way to grapheme clusters to have anything close to what someone would intuitively call a character. Calling a code point a character is not more correct or useful than calling a code unit a char.

All you gain by having Unicode code point strings is the illusion of Unicode support until you test anything that uses combining characters or variant selectors. In essence, languages opting for such strings are making the same mistake at Windows/Java/etc. did when adopting UTF-16.