Charsets.

I have the strange impression that I spent way too much time on fixing charset-related problems on Sylpheed-Claws.
First the badly encoded mail headers with raw 8 bit in them, where all I can do to fix undisplayable (as utf-8, the gtk2 internal charset) strings is

  • guess the original encoding (actually, see if it can be converted from the user’s locale charset to utf8),
  • or parse the whole string for undisplayable chars and replace them with an underscore (yay for speed)

Then, the conversions needed for local files and directory. This one is ugly, and the best thing to do is: Never use extended chars in your folder’s names. :-/

After that there are the xml files, which were happily reporting to be encoded in UTF-8, whereas they were encoded in the user’s locale charset, and then, read as UTF-8 (with a fallback to the user’s charset to fix the obvious errors this made happen). Fixing the xml files so they are written like they should be, obviously, broke people’s addressbooks and so on.

And of course the irrelevancy of old libc functions, like ispunct() that happily returns true whenever it reaches an UTF-8 encoded extended char. This particular bug made me realise that ispunct does not do what it claims to do:

ispunct()
    checks for any printable character which is not a space or an alphanumeric character.

And there I thought ispunct() would test if the character was actually a punctuation char.

But why, why, didn’t our « ancestors » think about such things and began with creating intelligent norms showing at least a bit of forward-thinking?

(And one day I’ll rant about the whole \r, \n, \r\n mess, too).