to double down on it it's not even about a weird name per-se but lot of very normal EU names
if a lot of early IT wouldn't have been dominated by a US "works for us must work for everyone" approach I think we never would have ended up with such limitations common in legacy systems (there still would be limitations, pre-unicode the solution was custom code pages and similar, which all supported some subset of non us-ascii but only a subset)
luckily today unicode is the standard (through for some cultural and historic aspects it's sometimes not enough)
I don't think this is strictly a technical problem, at least not when it happens in international contexts (it's inexcusable for your own country's authorities to not be able to record your actual culturally-specific name).
The reality is that there is a limited char set that is actually understandable at an international level, and it's not that different from ASCII. Even with paper systems, if you go to Rome to sign into a hotel and you give your name as 依诺, they will not be able to even write it down, nevermind pronounce it. And even if you tell them it's Yī Nuò, they will likely ignore the accents since they won't know what those mean. Similarly, if you go to China and say your name is Sângeorz-Băi, they will not know what the diacritics mean and will not be easily able to write them down.
In all times in history, when multiple cultures interact, they have to find a common subset of their languages to communicate in, and that includes names. The situation in writing is actually much much better, even if limitted to ASCII, than it is in actual spoken language. Maybe you can write my name down perfectly (Simionescu), but I would bet you won't use the proper pronunciation unless you happen to know Romanian - you will likely use different vowels and consonants.
What's interesting is that the transliteration of symbols is not even remotely uniform in icao 9303. There are multiple recommended transliterations of some characters, and it definitely goes only in one direction: national script -> MRZ transliteration. It is not possible to go the other direction.
It's not intended to round-trip, it's intended to be roughly human-readable without knowledge of the original script. It's pretty close to the system Olympics used, with the Wikipedia example of Hämäläinen -> HAEMAELAEINEN being well known as a gold medalist cross-country skiier.
Newer versions of the transliteration encourage stripping diacretics, so that would be HAMALAINEN. Much more readable to native speakers, but obviously loses information.
I wouldn't recommend it as there are official tranformations from the countries which uses diacretics and they most times are not to strip them. It's kinda another case of people forcing stuff onto other cultures. And if you do bussiness in some of the countries in some industries you might even get into legal trouble if you apply that.
Take it up with the spec, then. That is the recommendation:
> Section 6 of the 9303 part 3 document specifies transliteration of letters outside the A–Z range. It recommends that diacritical marks on Latin letters A-Z are simply omitted (ç → C, ð → D, ê → E, ñ → N etc.), but it allows the following transliterations: [...]
You said
> you might even get into legal trouble if you apply that.
We're talking about passports, this seems not relevant. For passport-related use such as travel, you use the form of the name written on the passport, exactly as-is.
It seems odd to me to arbitrarily restrict the alphabet if the only requirement is that the data has to be readable by a machine through an OCR system. They could have easily used the Latin alphabet to encode arbitrary bit strings.
> alphabet if the only requirement is that the data has to be readable by a machine
its the only requirement because
1. it's only meant for OCR
2. it's clear that it won't be used by only OCR but at least also human interacting with the system, potentially phone calls passing this information by voice, and anyone who can't pronounce the original spelling. For fairness if you e.g. didn't sing at all in your life are somewhat tone deaf and now are expected to pronounce a asian you probably have to spend days or more until you can do so (just as a extreme example).
If you mean ASCII, you'll also notice that it happens to correspond pretty well with the entirety of writing symbols which have broad global recognition, and this has been true for a long time. Sure, it's missing many many culturally specific things like accents and other diacritics, non-latin writing systems etc.
But none or at least very few of the symbols missing from ASCII are actually broadly understood by people from more than a handful of countries (which can still mean a billion plus people in the case of Chinese, Arabic, and Indian scripts, of course).
Probably Arabic is the biggest counterpoint to my claim, as there is quite a large array of countries across two continents that recognize it. However, even there, there are far more people in countries which use Arabic writing that also recognize Latin letters than the other way around.
The many diacritics used by various European languages are definitely NOT something that has any wide adoption or meaning. Perhaps only the umlaut sign and the accent are even used by more than one or two European languages.
So again, my claim is that any system of writing that is intended for global international communication will have to restrict all names to the A-Z characters in ASCII with spaces as separators (and perhaps 0-9 and a few other characters that would anyway get ignored). Nothing else will work if people around the world are supposed to recognize the name in some meaningful sense. And relying entirely on automated OCR is a no-go for many use cases.
And just like people who interact with those outside their cultures have to accept that their name will be pronounced in a myriad of ways, they have no reason not to accept that it will be written in different ways as well.
> it's missing many many culturally specific things like accents and other diacritics
fun fact: some of the symbols included in ASCII were intended to be used as (non-spacing) diacritical marks, specifically the tilde/caret/backquote characters...
[too lazy to dig up a proper source at the moment but the Wikipedia ASCII article covers some of this]
I completely agree it's not purely technical nor purely social but it is a real problem. Personally, I lost out on an equity options exit event due to delays caused partly by these exact issues and visas.
In Japan, all residents (citizens and people living with a visa) have to have a katakana spelling of their name. Web forms usually ask for this (in addition to your name spelled in kanji, which of course is impossible if your name is in latin characters, so you just hope the form accepts latin), and it's used in many other places as well, such as for bank accounts. Sometimes places will take your latin-character name and transliterate it themselves, and then this causes problems when their transliteration doesn't match that of other places. (katakana has far fewer distinct sounds available than most other languages, so the transliteration always loses information, and there's usually different ways to do it.) Even worse, many forms (like web forms) have rather short character limits for the name field, so with a transliterated Western name, it many times just won't fit within the ~10 characters they allocate.
> In all times in history, when multiple cultures interact, they have to find a common subset of their languages to communicate in […].
Oh… where do we start… We do not have to go as far as finding an intersection of multiple languages. Consider English as an example. I have written up a fictitious but a reasonably real dialogue between:
1. A layperson from a lower socio class, who has not attained high educational levels and speaks English using vernacular and predominantly Germanic vocabulary of the English language.
2. A state citizen of the upper-class descent who speaks English almost exclusively with Latin/French/Greek-derived vocabulary.
This layperson complains about not receiving a welfare payment from the state.
Layperson (L): Oi mate, I ain't got me geld from the state yet. That's daft, ain't it? Every man's got a right to his share, right?
Educated citizen of the upper class descent (U): Pardon my incredulity, but are you referencing the monetary allocation designated by the government for individuals of a particular socio-economic standing?
L: Eh? Oh, you mean the dole? Yeah, that. They owe me, but there's no dosh in me pocket yet.
U: If I interpret your sentiment correctly, you are perturbed due to the delayed disbursement of your financial entitlement. Have you endeavoured to communicate with the pertinent authorities?
L: Talk to who now? Oh, you mean the blokes at the town hall? Aye, but they keep spieling some rubbish. Can't make head or tail of it.
U: My advice would be to liaise with the relevant office, elucidate your predicament, and seek resolution. It is paramount to ensure you have met all requisite criteria for the stipend.
L: Right, so you're saying I should have a natter with 'em and make sure everything's shipshape? Just want what's owed to me, y'know.
U: Precisely. Engage in a dialogue with them, ascertain the cause of the discrepancy, and ensure you have fulfilled the necessary prerequisites for the allocation. You deserve your due compensation.
L: Cheers for that. It's a bit of a muddle, all this, but I reckon I'll give it another whirl.
U: I wish you fortitude in your pursuits. If there is an inherent right to such financial assistance, it is imperative you receive it posthaste.
Even though the layperson does understand responses in the fictitious dialogue, that would not be the case in real life. Both speak the same language, yet the responses are generally incomprehensible to the layperson.
In some cases I’d even suspect the incomprehension might go both ways. If the "educated" citizen is truly educated (and not just upper class), they ought to not only understand the lay person, but chose lay words in return. Sticking to their upper-class dialect would be passive-aggressive oppression born out of class contempt…
> In some cases I’d even suspect the incomprehension might go both ways.
Which is also true.
It was a thought experiment to highlight the fact that the lack of comprehension could also arise within the boundaries of a single language. In linguistics, the term for this specific phenomena is «the social register», and there are plenty of active and thriving language that employ the social register in the daily speech. Korean, for instance, is renowned for having a highly complex system of the social registers (effectively, parallel vocabularies) embedded in the spoken language. There are other languages as well.
> If the "educated" citizen is truly educated (and not just upper class), they ought to not only understand the lay person, but chose lay words in return.
And that is also true. Social registers have largely disappeared from mainland European languages, yet an English accent and the choice of the words of an English speaker can reveal sufficient details about their socio-economic background.
I get that it was a USAcentric thing, and that we should always be active w.r.t. calling out ethnocentric behavior.
But it was also an "8-bit" thing and a "extremely limited computing resources" thing. EBCDIC was designed in 1963/1964.
I mean, when you've got 8 bits to represent a character, and there are more than 256 possible characters... what do you do?
A truly robust solution like Unicode would not have been feasible with the resources of the day, and even a "simple" 16-bit scheme would barely be able to contain all 50,000 Chinese characters.
The blame here lies with the Dutch bank who willingly chose an EBCDIC solution in 1995, although I'm sure they were dealing with various constraints and pressures as well.
I can tell you that German bank thing I signed up in 2023 and can't be older than ~5 years asked for my FULL name as written on my ID, in my case it's "First Second Last" but I go by "First Last" but in their infinite wisdom they decided to ask for the full thing (which I understand, it's bank stuff) but never think about what they call me in their stupid email.
No one's ever called me "First Second" - not even my parents. And I can't even be mad, but I'm still disappointed. (Fortunately my name is ASCII and since we can do more than 8+3 I've never had problems.)
This is completely normal for official documents in Germany and it makes sense for us.
Technically we do not have first, second or any other numbered names. Our given names form a set in the mathematical sense and any one is equally valuable. This comes from the tradition of given names being given by godmothers and godfathers and we wouldn't want to get into the issue to ever have to value one of them over another. At least this has been the case in some parts of Germany and has influenced the official regulations for names.
Of course the names have to be put into an order on your ID and to keep things simple banks, schools, authorities, etc. ask you to use that order on their documents.
Traditionally, official documents just used the surname with "Herr" or "Frau" but nowadays they often use just the given name in first position on your ID.
If never heard of a "First Second" case with one exception:
Given names can be connected with a dash. In this case the order is fixed and the whole unit is treated like a single name. While in principle arbitrary names can be combined there are certain very common combinations, like "Hans-Peter", "Karl-Heinz" or "Franz-Xaver". If you happen to be named "Hans Peter"
(without dash) it's likely that they assume the dash and will call you "Hans Peter" or "Hans-Peter" all the time.
There is a very mild version of that in the US -- lower likelihood of blind assumption, but still present: when a set of two given names starts with "Mary" or ends with "Ann/Anne". Examples include "Mary Jane", "Mary Kate", "Jo Ann", and of course "Mary Ann". Some have simply merged into single names like "Maryanne" and "Joanne" more recently. There are probably others.
> This is completely normal for official documents in Germany and it makes sense for us.
Of course, and I mildly apologize for my case of Whataboutism because I actually described the reverse. They're taking the rules too literally and are using the thing they need for official documents everywhere (their marketing/status emails).
I'm just just kinda puzzled why they'd think it's a good user experience, especially for people who are not just not used to reading their government id name but actually uncomfortable (i.e. pending name change).
My first name starts with a W (let's say it's "Walter"). In India multiple times I have spelled it out over the phone and then received a letter addressed to "Uualter".
Hehe, and my name contains an ß which is often confused with a B.
I started to point out the machine readable area instead of just showing my passport…
I've seen that forever and not just on United. I have thought it's something about the underlying SABRE system that many airlines use. Maybe someone here knows more.
> I've seen that forever and not just on United. I have thought it's something about the underlying SABRE system that many airlines use. Maybe someone here knows more.
I don't remember the precise details, but some airline website's password had restrictions at one point that made it super-obvious that they were internally converting alphanumeric passwords to digits based on the US telephone key mapping.
I remember thinking at the time it might ultimately have been due to SABRE (because I believe that's literally one of the oldest computer system still in use), and screen-scraping some telephone menu system depressingly seems like something someone would do for expediency.
I wouldn't be surprised if a system like that also mangles names.
> Usernames and passwords containing letters need to be translated to numbers to enter them in a Fidelity phone system (like FAST, or if you call a representative). Use your telephone keypad to convert the letters to numbers. There is no case sensitivity. Substitute an asterisk (*) for all special characters.
https://www.fidelity.com/customer-service/need-help-logging-...
I can tell you from personal experience that if you have four names it will turn "First Second Third Last" into "Firstsecondthird Last" (I usually fly Delta).
I asked a checkin agent to fix it but they said it will start rejecting my ID if they change it at all.
On British airways (and I believe other ticket systems that use Amadeus), I often get LastnameTitleFirstnameSecondname all as one word (in caps). It certainly looks funny on the boarding pass, but I've never had any issue getting through security.
Something kinda similar, I applied for a PH passport and they ADDED a third name to my name. Instead of my actual "first second last" on my official PH documents I'm " first-second third last". The third isn't anywhere in my name/us birth certificate or any other identifying documents and not at all what my parents named me. I only use my US passport now because it caused a bit of confusion with my US departing airline ticket first name not matching my PH passport and if it had that, then the arrival into the US would not match my US passport.
The IT in question started with Hollerith cards¹, processed by electromechanical equipment. These were originally numeric only — digit n represented by a hole in row n which would stop or start a counter wheel. (Punched cards were processed row by row, not column by column.) The alphabetic extension added a second hole near the top edge, handled using much more complicated and expensive equipment. EBCDIC was originally a straightforward mapping of these holes into an 8 bit space, and its arrangement makes sense seen that way.
ASCII on the other hand derives much more from communications equipment (telegraphy) than IT gear.
I think you mean “built for us, and meets our needs”. It’s not the US’s problem that other countries don’t necessarily take innovation risks, but instead buy our old stuff.
The difference is that most countries don't expect the world to bow to them (culturally, technologically, etc). While I used Chinese products, I never had to learn how to spell my name with Chinese characters.
Even right now in modern day Japan I have to canonize my name in katakana (syllabary designed for foreign/loan words), and all the systems strictly expect a singular word First Name and a singular word Family Name. If you have a middle name, it effectively gets thrown out. Multi-word first and/or last names need to be smooshed or cut down.
I have encountered even worse issues digital forms that only accept kanji (Chinese characters) or hiragana (syllabary designed for native Japanese words), the latter of which usually does not support certain voices that katakana supports. Ashley Tisdale, for example, is normally rendered as アシュレイ・ティスデイル (ashurei tisudeiru) - ティ is actually te with a small -i modifier, which does not usually exist with hiragana. Forcibly converted to hiragana, it turns into あしゅれい・てぃでいる - but ぃ is not accepted by the form, even if it exists in UTF-8. Your options are either converting the ティ into ち (chi) or て (te), neither of which are ideal, and may cause mismatches to other systems that properly support the katakana version.
The problem extends further into physical paper forms, where often they provide a very limited amount of boxes for characters, because native Japanese and Chinese names can easily fit within 8 characters. Combine this with the digital systems above and you're bound to have several versions of your name floating around on official documents all mismatching each other.
Some systems that need to print onto physical cards (e.g. getting a 1/3/6 month route pass on your SUICA or PASMO contactless smart cards) are even worse and turn dakuten (diacritics for hiragana/katakana) into their own character. As an example, the character ほ (ho) can be turned into ぼ (bo) using a dakuten, or ぽ (po) using a handakuten. The system will instead render those as two separate characters: ほ゛ and ほ゜ respectively, which cuts down on the number of available characters for the already limited textbox space you're dealing with.
The world is full of presumptions about names even today.
> The problem extends further into physical paper forms, where often they provide a very limited amount of boxes for characters, because native Japanese and Chinese names can easily fit within 8 characters.
This happens in Europe quite often, even though many people have longer names.
Any idea if this is why, in Japanese-dubbed anime, the voice actors seriously mangle some English words/names? E.g., they often add a vowel sound to the ends of English words that should end with a percussive syllable.
I.e., do you think it comes from those words/names being written in katakana or hiragana in the dialog scripts, and those systems just can't express the correct pronunciation of such English words/names?
Actually, it's probably a simpler reason than that. The Japanese language is largely a CV syllable string (consisting of a consonant and vowels); consonant clusters do not exist, and the only final consonant permitted is 'n'. English, by contrast, is a much more phonotactically complex language--consonants can pretty freely appear both before and after vowels in a syllable, and English also has several consonant clusters. Imagine trying to pronounce the word "strengths" if your native language lacks consonant clusters--it's like an English person trying to pronounce the Czech phrase "Strč prst skrz krk". On top of that, Japan is not great at English proficiency (it's definitely weaker than any other rich country, see https://www.ef.com/wwen/epi/).
It's not really that the written language makes the names hard for them to pronounce, it's that the spoken language doesn't make it easy, and there's probably not enough care to try to pronounce them. Where the written language does make it hard, it's usually when people try to localize Japanese media into foreign languages, and the intended references in names are lost because of the mangling process of transcription into katakana.
As an English speaker who has traveled to Japan without learning much of the Japanese language, I agree generally but I also noticed that there are some cases where a vowel is written but not pronounced. For example, "gosaimasu" is mostly pronounced without the "u" (creating a counterpoint against final consonant other than "n" being forbidden) and "gozaimashita" is mostly pronounced without the second "i" (creating a counterpoint against consonant clusters such as "sht" being forbidden). It gives me the impression that these rules exist more in written Japanese than spoken Japanese, at which point it becomes less clear why adding a vowel to the end of foreign/imported words is so common. Maybe it's just my English perception that the sounds /s/ and /sh/ consist of pronouncing only a consonant, when in reality the fact that those sounds have duration (not just a moment) actually means it's more of a vowel even when totally unvoiced!
As I think on this further, even these voiceless /s/ and /sh/ sounds involve putting the lips into either an /u/ or an /i/ shape based on the following vowel even if that is also voiceless, creating that which is not a syllable in English, but perhaps is for this purpose in Japanese. The C-V cadence and final vowel (given lack of final -n) rules are satisfied...
Second, in Japanese dubs these words are not usually actual English words, but Japanese words originated as borrowings from English language, so voice actors don't actually mangle them, the same way as English speaking people don't mangle the word "coffee" as they usually pronounce it, despite it being different from how Italians pronounce "caffè".
> Any idea if this is why, in Japanese-dubbed anime, the voice actors seriously mangle some English words/names? E.g., they often add a vowel sound to the ends of English words that should end with a percussive syllable.
I don't know anything about anime, and little about Japanese, but I think Japanese (and Chinese) have a fairly strict consonant-vowel form for all their syllables. That makes foreign words that have runs of consonants or do not end it a vowel hard to pronounce, so speakers of those languages have a tendency to insert extra vowels to make pronunciation easier for themselves.
It's kind of like how English speakers will usually change the Pinyin "X" (as in Xi Jinping) into an English S or SH sound when they try to speak it, because the actual sound doesn't exist in English.
I think it's more that Japanese speakers just don't have those types of sounds in their phonetic repertoire. Some may be able to pronounce them, but most will not (and may not even notice the difference).
Every person has a certain limited set of consonants, vowels, diphtongs, triphtongs, tones, and even syllables that they are able to recognize and reproduce. This is something you can train to recognize more, but you will probably never be able to pronounce or even distinguish the totality of all those used in all languages, even just the living languages on Earth.
Even if you did, there is an added complication that some languages actually used multiple sounds interchangeably, and explicitly distinguishing them may actually confuse you. For example, most European languages recognize various consonants as the same "R" sound, even though they are vastly different (French R is a back of the throat trill, Italian R is a trill near the palate, and English R is articulated next to the palate without any trill). If you come from a language where these are distinct sounds, you may have trouble understanding that two people who use different R sounds are pronouncing the same word.
There is also the R/L problem, A sound that to me, a native english speaker, is fairly distinct. However these are the same sound in Japanese. Because of this I think that it is very hard for Japanese speakers to figure out which one to use and they get switched all the time.
If modern computers had been invented in China and had had a decade or two headstart on the rest of the world then you may well have had to do just that.
This was an accident of history, not some deliberate plan to get the world to bow to the English speakers. And English was already well established as a major language in trade (due to it being superficially simple to learn), next to German, French and Spanish. China was pretty isolated for a long time culturally as well as geographically and the complexity of its script is another barrier to it being accepted as a common language by the rest of the world.
One of the more interesting things along this line in recent history is that with Brexit the EU no longer has an England/Wales/Scotland and a chunk of Ireland in it, but another chunk of Ireland remains. This led the French to immediately propose that French become the official language of the EU parliament but the rest of the countries wouldn't have it, and rightly so.
> This led the French to immediately propose that French become the official language of the EU parliament but the rest of the countries wouldn't have it, and rightly so.
Didn't happen, they just said they'll use French during their council presidency (not the parliament, it's not even mentioned in your article), that's all, there are no rules against that. They would've done it regardless of Brexit.
Nothing to do with French seems a bit strong. It's related. From Brittanica:
> lingua franca, (Italian: “Frankish language”) language used as a means of communication between populations speaking vernaculars that are not mutually intelligible. The term was first used during the Middle Ages to describe a French- and Italian-based jargon, or pidgin, that was developed by Crusaders and traders in the eastern Mediterranean and characterized by the invariant forms of its nouns, verbs, and adjectives. These changes have been interpreted as simplifications of the Romance languages.
Heh, TIL, thanks. Obliquely, I was in Venice some years ago; sitting on the steps of a church I set to rolling a cigarette. A couple of small boys stopped and stared at this activity, one pointed and said "Il fabricato fumer!", I knew exactly what he was saying (although I have no Italian). So Venetian it is.
I think that french diplomat just saw their shot, and took it. I doubt they actually forgot that there's still two english-speaking countries in the EU.
I, however, don't think most of the people who started using one of the named languages instead of their mother tongue ever really selected English using that specific criterion.
You're saying this like it was some deliberately hostile, colonial move to impose ASCII on the world. But I don't think it was quite like that, more that in the beginnings of computing people designed and built things for themselves. And it just happened to be that a lot of that early work happened in the anglosphere.
I honestly think it has more to do with culture. I've never been to the US so this might be completely wrong, but my observations from talking to people and just observing:
- if you move to the US and have a name made up of non-ASCII chars you are more likely to either drop them/substitute them with ascii chars, or use the Anglicized version of your name if it exists, or adopt an English name. And then it's kinda easy to legally change or your name. Or screw it, it's kinda easy to just show up and tell them you're Johnny Awesome and then you're Johnny Awesome.
- if you move to Germany, you can't legally change your name at all without good reason, every document ever, no matter how informal (especially at school) will probably have your full name, maybe hopefully just "First Last" and not all 7 of them, everyone of authority will refuse to call you Johnny Awesome if your name is actually Johnathan Jean-Pierre Awesome-Livingston, and so on, oh and they will also fail to not butcher your name if it's not so easy a 4y old can learn it.
We can't be the only ones leaning more towards #2. And no, I'm not making this up, my go-to example is that I've seen cases where things like officially not calling "Bill Gates" "William Gates" have met resistance. Your name is your name, and I'm still not sure how people in the spotlight are able to be called Dick, I'm not joking.
Try living in an asian country - probably you will have to choose a name in the local script which at best vaguely sounds like your given name. It's expected that if you use someone elses playground that you adapt to their rules - that goes for moving to a foreign country and to using technology primarily developed in one.
> expect the world to bow to them (culturally, technologically, etc).
I mean, I don't expect you to bow to me.
But at the same time, the software I produce at work is usually entirely consumed by americans who speak english (ok, well, there's one canadian customer that I'm aware of). Because that's who pays for it, and none of those customers is particularly looking to pay for translation.
And the software I produce during my off hours is generally meant for me and my friends to consume. I'll put that on github/gitlab/source hut and you can use it if you want, but I definitely don't have the budget for translation either.
China has its own problems. There are obscure family names out there consisting of characters that aren't officially recognized, so computers can't process their actual family names. So those people instead pick the closest alternative officially recognized character instead, purely for the purpose of official documents and appeasing computer systems.
I think in premodern times the Chinese character set was not as centrally regulated as it is now, and therefore there should be quite many instances of independent/local character invention.
many chinese characters consist of combinations of other characters. most common is a combination of two, where one component suggests the meaning, while another hints at the pronunciation.
this shows that new characters can be created not by inventing new strokes, but by simply combining existing characters to convey a new meaning, much like we occasionally do create new words in english by combining existing ones, even though that process in english is not productive, unlike eg. german, where it is quite normal. the difference is that these new words only have one syllable.
with the digitalization the creation of new characters essentially ends. the creation of the simplified chinese character system also pushes against creating new, more complicated characters.
it is going to be interesting to see how that will affect language development. new "words" can still be created by using a sequence of characters, but that means that each character keeps their syllable sound. whereas new compound characters would have a single syllable. so if a new meaning emerges for a syllable, a new character can't be created for it. will this prevent new single-syllable words? or will it lead to multiple characters being pronounced with a single syllable?
Do Chinese characters always have the same pronunciation? In Japanese at least, their Kanji (which are derived from Chinese characters) are often read in entirely different ways in different contexts. For example, 二人 is read as "futari" (two people), but ニ alone is read "ni" and 人 alone is read as "hito".
Mostly yes. In Mandarin, tone can be a bit different depending on context but overall pronunciation doesn't differ that much.
But a major caveat is that pronunciation can be wildly different when spoken with other dialects. Mandarin and Cantonese reading of the same text, even with same meaning, sound entirely different.
that's a good question, i know that there are many characters that share one pronunciation, but i have not come across the reverse. there are different pronunciations in different dialects/languages of course, and maybe some of those get adopted by other dialects (that would make sense for food names for example) but i didn't study chinese, so i really don't know.
you haven't been in china long enough. i have had a few situations where the system was unable to write my name in latin characters. i even had to get a notarized transliteration of my name into chinese so that the resulting chinese version could be used on some official documents.
but it's especially stereotypical for mainly the US, China and I think Japan.
Which are all countries where to due to various reasons (size, culture/nationalism) there are a lot of people doing technical decisions which: 1st only speak the countries language, 2nd have little interaction with very different cultures
(in the US it's complicated, they have a lot of mixing other cultures into them, but do so in a very very US specific way with a lot of unaware cultural appropriation (and I don't mean this in the "bad/evil" way it's often used today but the cultural normal way) and the US is so large that there is little reason to make a trip to a country which is very different, and even if they do so, it's often in a form which is very touristy. This leads to situation where e.g. US citizens claim they are Spanish because of some ancestors and claim to practice Spanish culture but they are 0% clueless about actual Spain even after having traveled there twice or so. Contrast this with the EU where e.g. spending on study semester in another country which does have a completely different language and culture isn't rare, and non touristy holiday trips to other countries are common too (I mean in some cases it's just a few hours by care) and it's very easy, to have people which just don't know better.)
In the early 90's me and a couple of other French guys had taken to writing French unaccented, so that we wouldn't suffer the daily pain of character set problems - we really bashed our own heads to fit into lower ASCII !
Same here in Poland. I got used to skipping the diacritics, and quickly learned to mentally decode text that was rendered with the wrong code page.
And maybe this is a form of Stockholm syndrome, but to this day, I don't really mind sticking to lower ASCII - it just makes things easier (or at least until recently it did), and I don't really care about that '³' in my surname. Sorry, I meant 'ł'.
Possibly related to that:
- I always stick to US English language when using software, even if it offers Polish, because I don't trust translations. They're usually done by people who don't have enough context and knowledge to do it right. I've been burned too many times by this. Plus, localized error messages hinder searching for solutions.
- I wouldn't mind if everyone switched to English[0] as first language and called it a day; there would be tremendous economic and social benefits from that to everyone, far outweighing the loss of a little bit of cultural variety/noise.
- I'm strongly in favor of meeting machines half-way. LLMs aside, it's trivial for people to learn a small controlled vocabulary here and there (like e.g. "OR" and "AND" and quotes in search queries), allowing to make interfaces vastly more predictable, reliable and comprehensible.
--
[0] - Or French, or Chinese, or Swahili, neither of which I know - to stave off the usual replies of the "you only want it because you already know English" kind.
> I always stick to US English language when using software
Of course, the computer's native language is English - anything else would be silly. This may be a generational meme: young people have French language environments - even most of the computing professionals... But I keep my habit of mixed locales with metric measurements, English-language UI, mixed ISO 8601 and French dates. God bless UTF-8 though !
I really, really hate how Amazon tries to force US customary units (inches, etc.) on me in automatic translation just because I set the language to English.
I guess Polish people had it worse than Romanians and Hungarians. All our accents are simple to strip away and they still kinda sound like the base letter.
But I agree with you: I don't trust translations and I think the benefits from humans using a single language would be amazing.
Same for me. I am Italian but I always used computers with English localization because translation was often bad, especially IBM translations.
So I got used to using US keyboards also, and never using accented letters even though they are frequently used in Italian, e.g. instead of writing "Mario è alto" (Mario is tall) I would write "Mario e' alto". It helped a lot that I worked for almost two decades for companies having only non Italian clients and all communications internal and external were in English.
Now that I am working for an Italian company with Italian clients, I am slowly getting used to Italian keyboards and accented letters.
The funny thing to me is how these inofficial transcription rules differ from country to country. Seems most people with ö or ü from a Turkish name are happy to drop it to o and u (cf Mesut Özil) but Germans are absolutely not. (Not saying this is a rule, just what I observed.)
German has an official and fully information preserving transcription to basic latin (which is really what all this talk about "anglo letters" is, just the basic latin letters in common use in the early modern era, with no diacritics at all), which can be used in official documents, too. Other languages, like latinized Turkish, obviously copied the diacritics, but seem to have left out the transcription rules, most probably because they borrowed the letters long after their history was relevant.
For German these rules are offical. They're used in the machine readable part of ID cards and passports. If you start at a german company and they setup an e-mail address for you they'll use these rules too. The origin of the letters ä, ö and ü in German are ae, oe and ue, people just put the e on top of the vowel and they slowly transformed into what we use today. You can still see this in some names like Goethe.
I know that, but it's not what I meant, but probably didn't make it clear enough.
Germans know these official rules and maybe linguists, but if you present the typical English speaker with Möller vs Moeller it's confusing. Look at the media (who could, maybe, do some research?) who write Jurgen and not Jürgen. That's my point, the official rules don't help if everyone ignores them, for whatever reason.
In certain situations like cross word puzzle it wasnt usual to use "umlaut" but instead to write oe, ue and ae. In Swiss German people dont use ß.
So transcription was always a thing to know.
> Or French, or Chinese, or Swahili, neither of which I know - to stave off the usual replies of the "you only want it because you already know English" kind.
Yes and no. French sure, but in my experience of trying to learn Japanese, the difficulty is insane, much higher compared to the "western" languages.
in German there is an official convention for the non us-ascii letters, AFIK it predates computers and is rooted in germanic dialects developing differently and later stuff like non-german specific type writers, printing press machines, etc.
ä => ae
ö => oe
ü => ue
ß => ss
but one gotcha is that it's a fuzzy one way trip, some words, especially city names, can have a ae,oe,ue in their correct native spelling. Worse ss is a normal language building block in German which is pronounced differently then ß so writing it as ss is quite confusing for anyone which doesn't happen to know that it's correct spelling is with ß. To top that of some cases of ß spelling have officially changed to ss spelling over time due to people anyway pronouncing it more like a ss and getting it wrong all the time. And to some degree ß is semi-official abandoned by now.
The letters ß and ss are pronounced exactly the same, a hard 's' sound. Their effect on the preceding sounds is slightly different, though -- 'ss' makes the vowel short, 'ß' keeps it long. In all, a rather minute difference.
> if a lot of early IT wouldn't have been dominated by a US "works for us must work for everyone" approach I think we never would have ended up with such limitations common in legacy systems
Most of the early work was in English and the people buying these systems all understood English. Nobody back then had a problem with a lingua franca for aspects of tech because there were still people around who had to learn German to study science.
to double down on it it's not even about a weird name per-se but lot of very normal EU names
if a lot of early IT wouldn't have been dominated by a US "works for us must work for everyone" approach I think we never would have ended up with such limitations common in legacy systems (there still would be limitations, pre-unicode the solution was custom code pages and similar, which all supported some subset of non us-ascii but only a subset)
luckily today unicode is the standard (through for some cultural and historic aspects it's sometimes not enough)