Monday, March 28, 2011

Unix vs. Windows rendering of characters

I have a text file that display differently when opening it in FreeBSD vs. Windows.

On FreeBSD: An·lisis e InvestigaciÛn

On Windows: Análisis e Investigación

The windows representation is obviously right. Any ideas on how to get that result in bsd?

From stackoverflow
  • How is the file encoded? I would try re-encoding the file as UTF-16.

  • This is not pure ASCII. It's utf-8. Try freebsd editor with utf-8 support or change locales.

  • The problem is that it's not ASCII, but UTF-8. You have to use another editor which detects the encoding correctly or convert it to something your editor on freebsb understands.

    Jacek Ławrynowicz : "it it's probably ISO-8859-1". 'á' is displayed as '·', so it must be multibyte encoding (utf-8 or utf-16).
    Ant P. : It's definitely UTF-8. An easy way to tell is that those funny-looking accented A's will show up when encoding characters just outside the first 128 in Unicode.
    Georg : Right, sorry. Didn't see that.
  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  • From the way the characters are being displayed, I would say that file is UTF-8 encoded unicode. Windows is recognising this, and displaying the 'á' and 'ó' characters correctly, while FreeBSD is assuming it's ISO-8859-1, which results in these characters being displayed as 2 seperate characters (due to the UTF-8 encoding using 2 bytes). You'll have to tell FreeBSD that it is a UTF-8 file, somehow.

  • So after doing a bit more digging if 1) Open the csv file in excel on mac and export it as csv file and 2) then open it in textmate, copy the text, and save it again it works.

    The result of: file file.csv is

    UTF-8 Unicode English text, with very long lines

    The original is:

    on-ISO extended-ASCII English text, with very long lines

    This workaround isn't really suitable as this process is supposed to be automated, thanks for the help so far.

  • It doesn't matter which operating system you're using when you open the file. What matters is the application you use to open it. On Windows you're probably using Notepad, which automatically identifies the encoding as UTF-8.

    The app you're using on FreeBSD obviously isn't doing that. Maybe it just can't read UTF-8 and you need to use a different app. Or maybe you just have to tell it which encoding to use. Automatic detection of character encodings is far from universal (and much farther from perfect).

0 comments:

Post a Comment