Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Totally agree re: UTF-8 vs other Unicode encodings.

But are there still still hold-outs who don't like Unicode? Last I heard some CJK users were unhappy about Han Unification: http://en.wikipedia.org/wiki/Han_unification



I am a Korean user (K in CJK), and no one, I repeat, no one, care about Han unification here.

I heard that it is different in China and Japan though.


Probably because modern Korean text is Hangul, which is not really derived from the Han characters Chinese and Japanese have in common.

http://en.wikipedia.org/wiki/Hangul

http://en.wikipedia.org/wiki/Chinese_characters


Hanja is widely used in modern Korea.

http://en.wikipedia.org/wiki/Hanja


The main problem is that it means sort-by-unicode-codepoint puts things in a ridiculous order in japanese/korean. I kind of wish UTF-8 had the latin alphabet in a silly order, so that western programmers would realise they need to use locale-aware sort when sorting strings for display.


This is false. UTF-8 sorts Korean almost correctly. For practical purposes, you can use sort-by-unicode-codepoint to sort Korean.


I spoke with several Japanese people who said that some valid characters are not representable in Unicode.

That means that it's not just a technical problem (expensive sort routines or inefficient encodings) -- it's a semantic problem.


The way I've heard it explained, there are some historical alternate versions of some characters (A Latin-alphabet equivalent might be the way we sometimes draw "a" with an extra curl across the top, and sometimes without) that have the exact same semantic meaning, and so they were 'unified' to a single code-point. Unfortunately. some people spell their names exclusively with one variant or the other, and Han unification makes that impossible in Unicode.


isn't the real problem that you can't guarantee correct rendering of ideograph text without specifying fonts? there are japanese kanji that are drawn differently from the chinese hanzi they're descended from, but they're the same from a unicode perspective.

imagine if roman, greek, cyrillic, hebrew (aramaic), and ethiopian (ge'ez) were all assigned to the same group of code points and distinguishable only by font--they're all just variants of phoenician, after all....


Do you think that sort-by-unicode-codepoint is good enough to use for technical contexts where most content is english or at least represented by the Latin alphabet? For example, do you think it's a valid choice to sort by codepoint for Java symbol names in a code refactoring tool?

I ask because I expect that sort-by-codepoint is an order of magnitude more efficient.


It's not a valid choice for anything that actually uses unicode. E.g. if I have functions caféHide() and caféShow() I expect them to be next to each other. I think Java should perhaps have required symbol names to be ASCII, but it doesn't and Java tools should deal with this.


Sort by unicode codepoint does not work well in most western languages either. It is almost only English where it works good enough to be usable among the languages with a latin script.

For example the sort order is broken for all Nordic languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: