Contents
Debugging Unicode Problems
This page describes what to do in a very specific situation. Namely, you've
got some character data in one place (typically a database) which has to go
through various steps and then ends up being shown to the user (often on a web
page). Unfortunately, some characters aren't being displayed correctly. Due to
the many steps involved, the problem can occur in various places. This page aims
to help you find out what's wrong simply and reliably.
Top 
Step 1: Understand the basics of Unicode
If you feel comfortable with Unicode, character encodings etc, feel free to
skip this step. Basically, you need to know a little bit about what characters
are and what conversions are likely to be applied to them before going much
further. See my
article [^] on the subject (and the articles it references) for more
information.
Top 
Step 2: Try to identify the possible conversions involved
If you can work out where things might be going wrong, it's much
easier to then isolate which one it is. Also bear in mind not just how you're
retrieving the data, but how the data got there in the first place. (Some
problems I've seen have been due to an old application writing to and reading
from the database in an incorrect way, but the bugs cancelling each other out.
No problems occur when it's just this broken application which accesses the
database, but things go wrong when anything else does.) Steps involved may well
include fetching the data from the database, reading it from a file, sending it
across a web connection, or displaying it on the screen.
Top 
Step 3: Verify the data at each step
The first lesson here is not to trust anything which tries to log the
character data as a sequence of glyphs. Instead, you should log the character
data as a sequence of Unicode values (integers). For instance, if I had a string
containing the word "hello", I would display it as "0068 0065 006c 006c 006f".
(Using hex makes it easier to check values against the Unicode code charts
later.) To achieve this, step through each character in the string and display
the character however you would display an integer. For instance, here is a
method to dump all the characters in a string to the console:
static void DumpString (string value)
{
foreach (char c in value)
{
Console.Write ("{0:x4} ", (int)c);
}
Console.WriteLine();
}
Depending on your exact environment, your method of logging will vary, but
using something like the above should give you what you need.
The reason for doing this is that it gets rid of problems with fonts, other
encoding issues, etc. If you can't log even plain ASCII hex digits properly,
you're in a world of trouble anyway - but you may well not be able to log
Unicode in a reliable way, and as you already know you've got some
problems on the Unicode front, it's worth being safe.
Now you need to make sure there's a test case to use. Find some (preferrably
small) example of where your application is failing, make sure you know exactly
what the result should be, and then log the actual result at each of your
possible problem points. (Some may be out of your control, but usually if you
log as soon as you receive some data and just before you send some data, you'll
find the problem.)
Having logged a problematic string, you should verify whether or not it's
what it should be. This is where the
Unicode code charts [^] page comes in. You can either pick which block you
believe the correct character is in, or you can search for your character
alphabetically. Check that each character in the string has its proper Unicode
value. As soon as you find a point in your application flow where the character
data is corrupted, you should investigate that area of the code, find out why
it's being corrupted and fix it. When you've got it right throughout the
application flow, the application should be working properly.
Top 