Totally agree re: UTF-8 vs other Unicode encodings. But are there still still ho...

sanxiyn · on April 29, 2012

I am a Korean user (K in CJK), and no one, I repeat, no one, care about Han unification here.

I heard that it is different in China and Japan though.

derleth · on April 30, 2012

Probably because modern Korean text is Hangul, which is not really derived from the Han characters Chinese and Japanese have in common.

http://en.wikipedia.org/wiki/Hangul

http://en.wikipedia.org/wiki/Chinese_characters

kenmazy · on April 30, 2012

Hanja is widely used in modern Korea.

http://en.wikipedia.org/wiki/Hanja

lmm · on April 29, 2012

The main problem is that it means sort-by-unicode-codepoint puts things in a ridiculous order in japanese/korean. I kind of wish UTF-8 had the latin alphabet in a silly order, so that western programmers would realise they need to use locale-aware sort when sorting strings for display.

sanxiyn · on April 29, 2012

This is false. UTF-8 sorts Korean almost correctly. For practical purposes, you can use sort-by-unicode-codepoint to sort Korean.

jeffdavis · on April 30, 2012

I spoke with several Japanese people who said that some valid characters are not representable in Unicode.

That means that it's not just a technical problem (expensive sort routines or inefficient encodings) -- it's a semantic problem.

thristian · on April 30, 2012

The way I've heard it explained, there are some historical alternate versions of some characters (A Latin-alphabet equivalent might be the way we sometimes draw "a" with an extra curl across the top, and sometimes without) that have the exact same semantic meaning, and so they were 'unified' to a single code-point. Unfortunately. some people spell their names exclusively with one variant or the other, and Han unification makes that impossible in Unicode.

adavies42 · on April 30, 2012

isn't the real problem that you can't guarantee correct rendering of ideograph text without specifying fonts? there are japanese kanji that are drawn differently from the chinese hanzi they're descended from, but they're the same from a unicode perspective.

imagine if roman, greek, cyrillic, hebrew (aramaic), and ethiopian (ge'ez) were all assigned to the same group of code points and distinguishable only by font--they're all just variants of phoenician, after all....

haberman · on April 30, 2012

Do you think that sort-by-unicode-codepoint is good enough to use for technical contexts where most content is english or at least represented by the Latin alphabet? For example, do you think it's a valid choice to sort by codepoint for Java symbol names in a code refactoring tool?

I ask because I expect that sort-by-codepoint is an order of magnitude more efficient.

lmm · on May 1, 2012

It's not a valid choice for anything that actually uses unicode. E.g. if I have functions caféHide() and caféShow() I expect them to be next to each other. I think Java should perhaps have required symbol names to be ASCII, but it doesn't and Java tools should deal with this.

jeltz · on April 30, 2012

Sort by unicode codepoint does not work well in most western languages either. It is almost only English where it works good enough to be usable among the languages with a latin script.

For example the sort order is broken for all Nordic languages.