So, in terms of acing interviews, increasingly one of the best answers to the question "Write some code that reverses a string" is that in a world of unicode, "reversing a string" is no longer possible or meaningful.
You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.
I'd go further and argue that in general reversing a string isn't possible or meaningful.
It's just not a thing people do, so it's just... not very interesting to argue about what the 'correct' way to do it is.
Similarly, any argument over whether a string has n characters or n+1 characters in it is almost entirely meaningless and uninteresting for real world string processing problems. Allow me to let you into a secret:
there's never really such a thing as a 'character limit'
There might be a 'printable character width' limit; or there might be a 'number of bytes of storage' limit. Which means interesting questions about a string include things like 'how wide is it when displayed in this font?' or 'how many bytes does it take to store or transmit it?'... But there's rarely any point where, for a general string, it is really interesting to know 'how many characters does the string contain?'
Processing direct user text input is the only situation where you really need a rich notion of 'character', because you need to have a clear sense of what will happen if the user moves a cursor using a left or right arrow, and for exactly what will be deleted when a user hits backspace, or copied/cut and pasted when they operate on a selection. The ij ligature might be a single glyph, but is it a single character? When does it matter? Probably not at all unless you're trying to decide whether to let a user put a cursor in the middle of it or not.
And next to that, I just feel to argue that there is such a thing as a 'correct' way to reverse "Rijndæl" according to a strict reading of Unicode glyph composability rules seems like a supremely silly thing to try to do.
I'd much rather, when asked to reverse a string, more developers simply said 'that doesn't make sense, you can't arbitrarily chunk up a string and reassemble it in a different order and expect any good to come of it'.
Boy, that's implicitly a good question... when's the last time I "reversed" a string, on purpose, for something useful?
It took me a bit, but I think I have an answer. It's about 15 years ago. I didn't actually do the original design, but I perpetuated it and didn't remove it. We reversed domain name strings (which, given that they are a subset of ASCII, actually is a well-defined operation) so that the DB we're using, which supported efficient prefix lookups but not suffix lookups, could be used to efficiently query for all subdomains of a given domain, by reversing the domain and using that as the prefix.
I mean this as strong support for your point, not a contradictory "gotcha". I'm a big believer in not doing lots of work to save effort or make correct something you do less than once a decade, e.g., http://www.jerf.org/iri/post/2954 . And it's not even a gotcha anyhow, because we aren't reversing a general string; we were reversing a string very tightly constrained to a subset of ASCII where the operation was fully well-defined. I can't think of when I ever reversed a general string.
Right - any case where you are reversing a string as part of some other operation you will have some goal in mind that is not simply 'produce the reverse of any arbitrary string'. Even if your goal is doing something like printing the crossword puzzle answers backwards at the bottom of the page, you have a tightly constrained set of possible characters so you can literally just throw an error if someone asks you to reverse a string containing a flag.
I actually should admit, for all my protesting above that you never need to do this, I did once actually implement something that "required", as part of the process, reversing a string. It should be apparent once I share what it was why I put scare-quotes around "required" though.
We wanted to test and demonstrate the localization and unicode-readiness capabilities of our software, and to verify that every UI string was actually coming from the resource file for the selected locale, and handled in a unicode-safe way.
So I implemented a program that took in the en-GB resource file, and outputted an en-AU one that contained all the original strings, just flipped upside down. This being, of course, the canonical way to localize a product for Australia.
And to turn a string upside down, you need to reverse the order of the characters, before mapping them to their unicode upside-down equivalent.
Unfortunately, the Unicode consortium do not make available a comprehensive database of which glyphs are 180º reversals of other glyphs, so my solution ended up not having comprehensive coverage of all unicode codepoints, but since my source data was en-US text that wasn't that important; what was more important was that some of the resource strings used a 'safe subset' of HTML so I needed to not turn <strong> into <ƃuoɹʇs>.
More than anything, it was probably that experience that gave me a true appreciation for what nonsense it is to try to break a string into characters and manipulate them.
(Also, while I do love the ingenuity of string reversal for suffix-based indexing, reversing a domain name for efficient prefix-based lookup can of course also be done by breaking the name up into subcomponents (thus not requiring you to care about character composition at all between dots), reversing the sequence of those parts and reassembling the string from the components in reverse order - which has the added benefit of preserving human readability of the domain name, and a natural sort order...)
"reversing the sequence of those parts and reassembling the string from the components in reverse order"
Given that this was Perl and that's a small chunk of code, it's probably what I would have done in the same circumstance, but given that it already existed it wasn't worth shipping a migration out to the field with a new version. Generally humans didn't consult this table anyhow.
But it was good for a couple of good "wtf is that" faces from other developers the first time they look at the DB, if nothing else. They get it pretty quickly; the preponderance of "moc." and "ude." gets to be a dead giveaway pretty quickly, especially combined with some popular names ("moc.elgoog" almost sounds like a real domain Google might register someday). But still fun if you catch their face at the right moment.
Reversing a string is still meaningful. Take a step back outside the implementation and imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.
There is a solution to this which is to compute the list of grapheme clusters, and reverse that.
> imagine handing a Unicode string to a human. They could without any knowledge look at the characters they see and produce the correct string reversal.
I really highly doubt it.
How do you reverse this?: مرحبًا ، هذه سلسلة.
Can you do it without any knowledge about whether what looks like one character is actually a special case joiner between two adjacent codepoints that only happens in one direction? Can you do it without knowing that this string appears wrongly in the HN textbbox due to an apparent RTL issue?
It's just not well-defined to reverse a string, and the reason we say it's not meaningful is that no User Story ever starts "as a visitor to this website I want to be able to see this string in opposite order, no not just that all the bytes are reversed, but you know what I mean."
You can even demonstrate a similar concept with English and Latin characters. There is no single thing called a "grapheme" linguistically. There are actually two different types of graphemes. The character sequence "sh" in English is a single referential grapheme but two analogical graphemes. Depending on what the specification means, "short" could be reversed as either "trosh" or "trohs". That's without getting into transliteration. The word for Cherokee in the Cherokee language is "Tsalagi" but the "ts" is a Latin transliteration of a single Cherokee character. Should we count that as one grapheme or two?
Of course, if an interviewer is really asking you how to do this, they're probably either 1) working in bioinformatics, in which case there are exactly four ASCII characters they really care about and the problem is well-defined, or 2) it's implementing something like rev | cut -d '-' -f1 | rev to get rid of the last field and it doesn't matter how you implement "rev" just so long as it works exactly the same in reverse and you can always recover the original string.
The fact that how to reverse a piece of text is locale dependent doesn't mean it's impossible. Basically and transformation on text will be locale dependent. Hell, length is locale dependent.
>what looks like one character is actually a special case joiner between two adjacent codepoints
Are you referring to a grouping not covered by the definition of grapheme clusters (which I am only passingly familiar with)? If so, then I don't think it's any more non-meaningful to reverse it than to reverse an English string. The result is gibberish to humans either way - it sounds more like you're saying that there is no universally "meaningful to humans" way to reverse some text in potentially any language, which is true regardless of what encoding or written language you're using. I was thinking of it more from the programmer side - i.e. that Unicode provides ways to reverse strings that are more "meaningful" (as opposed to arbitrary) than e.g. just reversing code points.
I mean no but only because I don’t understand the characters. Someone who reads Arabic (I assume based on the shape) would have no trouble. You’re nitpicking cases where for some readers visual characters might be hard to distinguish but it doesn’t change the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.
> the fact that there exists a correct answer for every piece of text that will be obvious to readers of that text which is the definition of a grapheme cluster.
No, I insist there is not a single "correct answer," even if a reader has perfect knowledge of the language(s) involved. Now remember, this is already moving the goalposts, since it was claimed that a human needed "no knowledge" to get to this allegedly "correct answer."
You already admit that people who don't speak Arabic will have trouble finding the "grapheme clusters," but even two people who speak Arabic may do your clustering or not, depending on some implicit feeling of "the right way to do it" vs taking the question literally and pasting the smallest highlight-able selection of the string in reverse at a time.
Anyway, take a string like this: "here is some Arabic text: <RLM> <Arabic codepoints> <LRM> And back to English"
Whether you discard the ordering mark[0], keep them, or inverse them is an implementation decision that already produces three completely different strings. Unless we want to write a rulebook for the right way to reverse a string, it remains an impossibility to declare anything the correct answer, and because there is no reason to reverse such a string outside of contrived interview questions and ivory tower debates, it is also meaningless.
You added the requirement that it be a single correct answer. I just asserted that there existed a correct answer. You're being woefully pedantic -- a human who can read the text presented to them but no knowledge of unicode was my intended meaning. Grapheme clusters are language dependent and chosen for readers of languages that use the characters involved. There's no implicit feeling, this is what the standards body has decided is the "right way to do it." If you want to use different grapheme clusters because you think the Unicode people are wrong then fine, use those. You can still reverse the string.
Like what are you even arguing? You declared that something was impossible and then ended with that it's not only possible but it's so possible that there are many reasonable correct answers. Pick one and call it a day.
It is impossible to "correctly reverse a string" because "reverse a string" is not well defined. We explored many different potential definitions of it, to show that there is no meaningful singular answer.
> You added the requirement that it be a single correct answer.
Your original post says "they could produce the correct string reversal"?
UAX #29 is insufficient: at the very least, you must depend on collation too.
In Norwegian, “æ” is a letter, so I believe (as a non-speaker) that they would reverse “blåbærene” to “eneræbålb”; but in English, it’s a ligature representing the diphthong “ae”, and if asked to reverse “æsthetic” I would certainly write “citehtsea” and consider “citehtsæ” to be wrong. (And I enjoy writing the ligature; I fairly consistently write and type æsthetic rather than aesthetic, though I only write encyclopædia instead of encyclopaedia when I’m in a particular sort of mood.)
In Dutch, the digraph “ij” is sometimes considered a ligature and sometimes a letter; as a non-speaker, I don’t know whether natives would say that it should be treated as an atom in reversing or not.
And not all languages will have the concept of reversing even letters, let alone other things. Face it: in English we have the concept of reversing things, but it just doesn’t work the same way in other languages. Sure, UAX #29 defines something that happens to be a good heuristic for reversing, but it doesn’t define reversing, and in the grand scheme of things reversing grapheme-wise is still Wrong. “Reversing a string” is just not a globally meaningful concept.
Another person here has cited Cherokee transliteration, where one extended grapheme cluster turns into multiple English letters. You can apply this to translation in general, but also even keep it inside English and ask: what are we reversing? Letters? Phonemes? Syllables? Words? There are plenty of possibilities which are used in different contexts (and it’s mostly in puzzles, frankly, not general day-to-day life).
The concept of grapheme clusters is acknowledged as approximate. Collations are acknowledged as approximate. Reversing would be even more approximate.
Interesting addendum on the matter of reversing “æsthetic”: I asked my parents what they reckoned, and they both went the other way, reckoning that if you wrote æ in the initial word it should stay æ; my dad said he wouldn’t write it that way in the first place, but that if you’ve written it that way you were treating æ as a letter more than just a way of drawing a certain pair of letters. In declaring otherwise, I was using the linguistic approach, which acknowledges æ as a ligature of the ae diphthong from Latin, being purely stylistic and not semantic. And so we see still more how these things are approximations and subjective.
Keep it first? Like that’s not a gotcha. Your input is a string and the output is that string visually reversed. What it looks like in memory is irrelevant.
The use case: Make an animation, where the text appears starting from the end - for example if you stylize it as vertical text falling from the top.
I think the example shows quite clearly the problem really comes down to dividing the string to logical parts (atoms, grapheme clusters, however you want to call it).
UTF-8 reverse string has been a thing for a long time in most/all programming languages. It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.
"It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible."
It depends on your point of view. From a strict point of view, it does exactly mean it is no longer possible. By contrast, we all 100% knew what reversing an ASCII string meant, with no ambiguity.
It also depends on the version of Unicode you are using, and oh by the way, unicode strings do not come annotated with the version they are in. Since it's supposed to be backwards compatible hopefully the latest works, but I'd be unsurprised if someone can name something whose correct reversal depends on the version of Unicode. And, if not now, then in some later not-yet-existing pair of Unicode standards.
I always thought it was interesting that ASCII is transparently just a bunch of control codes for a typewriter (where "strike an 'a'" is a mechanical instruction no different from "reset the carriage position"), but when we wanted to represent symbolic data we copied it and included all of the nonsensical mechanical instructions.
> It may not work perfectly in 100% of the cases, but that doesn't mean reversing a string is no longer possible.
I don't understand why in maths finding one single counter-example is enough to disprove a theorem yet in programming people seem to be happy with 99.x % of success rate. To me, "It may not work perfectly in 100% of the cases" exactly means "no longer possible" as "possible" used to imply that it would work consistently, 100% of the time.
> "reversing a string" is no longer possible or meaningful.
If you really wanted to, you could write a string reversal algorithm that treated two-character emojis as an indivisible element of the string and preserved its order (just as you'd need to preserve the order of the bytes in a single multi-byte UTF-8 character). You'd just need to carefully specify what you mean by the terms "string", "character" and "reverse" in a way that includes ordered, multi-character sequences like flag emojis.
I would argue that it is possible and meaningful. AFAIK extended grapheme clusters are well defined by the standard, and are very well suited to the default meaning of when somebody says "character", so, given no other information, it's reasonable to reverse a string based on them. I guess the issue is "reverse a string" lacks details, but I think that's different from "not meaningful".
You'll probably be told "oh, assume US ASCII" or something, but in the meantime, if you can back that up when they dig into it, you'll look really smart.