Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You're thinking from the point of view of a display program that needs to split strings into glyphs. That's a fine application, but very rare. And yes, it's inherently encoding dependent and tends to like wide characters instead of multibyte ones.

But introducing a BOM for the sake of that application is a disaster, because it hurts everything else. You can (literally) feed UTF-8 to parsers written 30 years ago and apply all your existing intuition about string handling in C without worry. Unless you deliberately break it by including a binary, non encoding garbage furball at the front of your "file" (and good luck figuring out what a "file" should mean in a OS metaphor designed around streams).



If a file really is pure ASCII, leave it that way. I am not suggesting to do otherwise. If a 30-year-old program only deals with ASCII then make sure your input looks like ASCII.

But if your input could contain complex UTF-8 (e.g. it's multi-language or whatever), you're not doing any favors by hiding this fact. The BOM is a quick way to know exactly what the file is, and it shows you that your program won't work with that input. So you translate the input or you fix the program.

At some point in the future the majority of programs will handle even complex UTF-8 properly, and then the BOM will be pointless because virtually all inputs will be UTF-8.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: