ContentsScope of this page
This is a big topic. Don't expect this page to do more than scratch
the surface - indeed, if you believe you're already fairly experienced and
knowledgeable about character encodings and the like, this page may well not
have anything new or useful for you. However, there are still many people who
don't understand the difference between binary and text, or know what a
character encoding is, etc. It is for these people that this page has been
written. It mentions a few advanced topics, but only to make the reader aware of
their existence, rather than to give much guidance on them.
Top  Resources
The links below are probably all at least as useful as this page, and
probably more so - but there's more to read in them, too. I referred to all of
them (and more) when writing this page. There's a lot of good information, and
while there may be some inaccuracies on this page (if you spot any, please mail
me at skeet@pobox.com [^]) these resources
should be correct.
- The Unicode Web Site Main Page [^]
- The definitive resource about Unicode, this is somewhat intimidating but
will have all the answers you need about Unicode itself - somewhere! Some of
the links below are just helpful pages from the site.
- The Unicode Glossary [^]
- At-a-glance definitions of many of the terms used when discussing
character encoding (etc) issues.
- The Unicode FAQ [^]
- Answers to hundreds of common questions, divided into sections.
- Unix/Linux
UTF-8/Unicode FAQ [^]
- Don't be put off by the title if you don't like Unix/Linux - most of the
information here is very relevant to .NET issues.
- The Unicode
Character Encoding Model [^]
- Gives more information about precise meanings of "character encoding
scheme" etc.
- The
Absolute Minimum Every Software Developer Absolutely, Positively Must Know
About Unicode and Character Sets (No Excuses!) [^]
- A page somewhat similar to this one, but without the .NET emphasis.
- On
the goodness of Unicode [^]
- Another introductory page which is worth a read.
Top  Binary and text - a big distinction
Most modern computer languages (and some older ones) make a big
distinction between "binary" content and "character" (or "text") content .
The difference is largely the same as the instinctive one, but for the
purposes of clarity, I'll define it here as:
- Binary content is a sequence of octets (bytes in common parlance)
with no intrinsic meaning attached. Even though there may be external
means of understanding a piece of binary content to be, say, a picture,
or an executable file, the content itself is just a sequence of bytes.
(Note for pedantic readers: from now on, I won't use the word "octet".
I'll use "byte" instead, even though strictly speaking a byte needn't be
an octet. There have been architectures with 9-bit bytes, for instance.
I don't believe that's a particularly relevant or useful distinction to
make in this day and age, and readers are likely to be more comfortable
with the word "byte".)
- Character content is a sequence of characters.
The Unicode Glossary [^]
defines a character as:
- The smallest component of written language that has semantic value;
refers to the abstract meaning and/or shape, rather than a specific
shape (see also glyph), though in code tables some form of visual
representation is essential for the reader's understanding.
- Synonym for abstract character. (See Definition D3 in Section 3.3,
Characters and Coded Representations .)
- The basic unit of encoding for the Unicode character encoding.
- The English name for the ideographic written elements of Chinese
origin. (See ideograph (2).)
That may or may not be a terribly useful definition to you, but for the
most part you can again use your instinctive understanding - a character is
something like "the capital letter A", "the digit 1" etc. There are other
characters which are less obvious, such as: combining characters such as "an
acute accent", control characters such as "newline", and formatting
characters (invisible, but affect surrounding characters). The important
thing is that these are fundamentally "text" in some form or other. They
have some meaning attached to them.
Now, unfortunately in the past, this distinction has been very blurred -
C programmers are often used to thinking of "byte" and "char" as being
interchangeable, to the extent that they will talk about reading a certain
number of characters, even when the content is entirely binary. In modern
environments such as .NET and Java, where the distinction is clear and
present in the IO libraries, this can lead to people attempting to copy
binary files by reading and writing characters, resulting in corrupt output.
Top  Where does Unicode come in?
The Unicode Consortium is a body trying to standardise the handling of
character data, including its transformation to and from binary form
(otherwise known as encoding and decoding). There is also a set of ISO
standards (10646 in various versions) which do similar things; Unicode and
ISO 10646 can largely be regarded as "the same thing" in that they are
compatible in almost all respects. (In theory ISO 10646 defines a larger
potential set of characters, but this is never likely to become an issue.)
Most modern computer languages and environments, such as .NET and Java, use
Unicode for character representations. Unicode defines, amongst other
things, an abstract character repertoire (the set of characters it
covers), a coded character set (a mapping from each character in the
repertoire to a non-negative integer), some character encoding forms
(mappings from the non-negative integers in the coded character set to
sequences of "code units" (eg bytes)), and some character encoding
schemes (mappings from sequences of code units into a serialized byte
sequences). The difference between a character encoding form and a character
encoding scheme is slightly subtle, but takes account of things like
endianness. (For instance, the UCS-2 code unit sequence 0xc2 0xa9 may be
serialized as 0xc2 0xa9 or 0xa9 0xc2, and it's the character encoding scheme
that decides that.)
The Unicode abstract character repertoire can, in theory, hold up to
1114112 characters, although many are reserved to be invalid and the rest
aren't all likely to ever be assigned. Each character is coded as an integer
between 0 and 1114111 (0x10ffff). For instance, capital A is coded as 65.
Until a few years ago, it was hoped that only characters in the range 0 to
2^16-1 would be required, which would have meant that each character would
only have required 2 bytes to be represented. Unfortunately, more characters
were needed, surrogate pairs were introduced. They confuse things
significantly (at least, they confuse me significantly) and most of
the rest of this page will ignore their existence - I'll cover them briefly
in the "nasty bits" section.
Top  What does .NET provide?
If all of this sounds rather confusing, don't worry. It's worth being
aware of the distinctions above, but they don't often actually come to the
fore. Most of the time you just want to convert some bytes into some
characters, and vice versa. This is where the System.Text.Encoding
class comes in, along with the System.Char structure (aka char in C#) and the
System.String class (aka string in C#).
The char is the most basic character type. Each char
is a single Unicode character. It takes 2 bytes in memory, and can take a
value of 0-65535. Note that not all values are thus actually valid Unicode
characters.
A string is just a sequence of chars,
fundamentally. It's immutable, which means that once you've created a string
instance (however you've done it) you can't change it - the various methods
in the string class which suggest that they're changing the string in fact
just return a new string which is the original character sequence with the
changes applied.
The System.Text.Encoding class provides facilities for
converting arrays of bytes to arrays of characters, or strings, and vice
versa. The class itself is abstract; various implementations are provided by
.NET and can easily be instantiated, and users can write their own derived
classes if they wish. (This is quite a rare requirement, however - most of
the time you'll be fine with the built-in implementations.) An encoding can
also provide separate encoders and decoders, which maintain state between
calls. This is necessary for multi-byte character encoding schemes, where
you may not be able to decode all the bytes you have so far received from a
stream. For instance, if a UTF-8 decoder receives 0x41 0xc2, it can return
the first character (a capital A) but must wait for the third byte to
determine what the second character is.
Top  Built-in encoding schemes
.NET provides various encoding schemes "out of the box". What follows
below is a description (as far as I can find) of the various different
encoding schemes, and how they can be retrieved.
Top  ASCII
ASCII is one of the most commonly known and frequently misunderstood
character encodings. Contrary to popular belief, it is only 7 bit - there
are no ASCII characters above 127. If anyone says that they wish to encode
(for example) "ASCII 154" they may well not know exactly which encoding they
actually mean. If pressed, they're likely to say it's "extended ASCII".
There is no encoding scheme called "extended ASCII". There are many 8-bit
encodings which are supersets of ASCII, and usually it is one of these which
is meant - commonly whatever Windows Code Page is the default for their
computer. Every ASCII character has the same value in the ASCII encoded as
in the Unicode coded character set - in other words, ASCII x is the
same character as Unicode x for all characters within ASCII. The .NET
ASCIIEncoding class (an instance of which can be easily
retrieved using the Encoding.ASCII property) is slightly odd,
in my view, as it appears to encode by merely stripping away all bits above
the bottom 7. This means that, for instance, Unicode character 0xb5 ("micro
sign") after encoding and decoding would become Unicode 0x35 ("digit five"),
rather than some character showing that it was the result of encoding a
character not contained within ASCII.
Top  UTF-8
UTF-8 is a good general-purpose way of representing Unicode characters.
Each character is encoded as a sequence of 1-4 bytes. (All the characters <
65536 are encoded in 1-3 bytes; I haven't checked whether .NET encodes
surrogates as two sequences of 1-3 bytes, or as one sequence of 4 bytes). It
can represent all characters, it is "ASCII-compatible" in that any sequence
of characters in the ASCII set is encoded in UTF-8 to exactly the same
sequence of bytes as it would be in ASCII. In addition, the first byte is
sufficient to say how many additional bytes (if any) are required for the
whole character to be decoded. UTF-8 itself needs no byte-ordering mark
(BOM) although it could be used as a way of giving evidence that the file is
indeed in UTF-8 format. The UTF-8 encoded BOM is always 0xef 0xbb 0xbf.
Obtaining a UTF-8 encoding in .NET is simple - use the Encoding.UTF8
property. In fact, a lot of the time you don't even need to do that - many
classes (such as StreamWriter) used UTF-8 by default when no
encoding is specified. (Don't be misled by Encoding.Default -
that's something else entirely!) I suggest always specifying the encoding
however, just for the sake of readability.
Top  UTF-16 and UCS-2
UTF-16 is effectively how characters are maintained internally in .NET.
Each character is encoded as a sequence of 2 bytes, other than surrogates
which take 4 bytes. The opportunity of using surrogates is the only
difference between UTF-16 and UCS-2 (also known as just "Unicode"), the
latter of which can only represent characters 0-0xffff. UTF-16 can be
big-endian, little-endian, or machine-dependent with optional BOM (0xff 0xfe
for little-endianness, and 0xfe 0xff for big-endianness). In .NET itself, I
believe the surrogate issues are effectively forgotten, and each value in
the surrogate pair is treated as an individual character, making UCS-2 and
UTF-16 "the same" in a fuzzy sort of way. (The exact differences between
UCS-2 and UTF-16 rely on deeper understanding of surrogates than I have, I'm
afraid - if you need to know details of the differences, chances are you'll
know more than I do anyway.) A big-endian encoding may be retrieved using
Encoding.BigEndianUnicode, and a little-endian encoding may be retrieved
using Encoding.Unicode. Both are instances of System.Text.UnicodeEncoding,
which can also be constructed directly with appropriate parameters for
whether or not to emit the BOM and which endianness to use when encoding. I
believe (although I haven't tested) that when decoding binary content, a BOM
in the content overrides the endianness of the encoder, so the programmer
doesn't need to do any extra work to decode appropriately if they either
know the endianness or the content contains a BOM.
Top  UTF-7
UTF-7 is rarely used, in my experience, but encodes Unicode (possibly
only the first 65535 characters) entirely into ASCII characters (not
bytes!). This can be useful for mail where the mail gateway may only support
ASCII characters, or some subset of ASCII (in, for example, the EBCDIC
encoding). This description sounds fairly woolly for a reason: I haven't
looked into it in any detail, and don't intend to. If you need to use it,
you'll probably understand it reasonably well anyway, and if you don't
absolutely have to use it, I'd suggest steering clear. An encoding instance
in .NET can be retrieved using Encoding.UTF7
Top  Windows/ANSI Code Pages
Windows Code Pages are usually either single or double byte character
sets, encoding up to 256 or 65536 characters respectively. Each is numbered,
an encoding for a known code page number can be retrieved using Encoding.GetEncoding(int). Code pages are mostly useful for legacy
data which is often stored in the "default code page". An encoding for the
default code page can be retrieved using Encoding. Again, I try to avoid
using code pages where possible. More information is available in the MSDN.
Top  ISO-8859-1 (Latin-1)
Like ASCII, every character in Latin-1 has the same code there as in
Unicode. I haven't been able to ascertain for certain whether or not
Latin-1 has a "hole" of undefined characters from 128 to 159, or whether it
contains the same control characters there that Unicode does. Latin-1 is
also code page 28591, so obtaining an encoding for it is simple: Encoding.GetEncoding (28591).
Top  Streams, readers and writers
Streams are by their nature binary - they read and write bytes,
fundamentally. Anything which takes a string is going to do some kind of
conversion to bytes, which may or may not be what you want. The equivalents
of streams for reading and writing text are System.IO.TextReader
and System.IO.TextWriter respectively. If you have a stream
already, you can use System.IO.StreamReader (which derives from
TextReader) and System.IO.StreamWriter (which
derives from TextWriter) respectively, constructing them with
the stream and the encoding you wish to use. If you don't specify the
encoding, UTF-8 is assumed. Here is some example code to convert a file from
UTF-8 to UCS-2: using System;
using System.IO;
using System.Text;
public class FileConverter
{
const int BufferSize = 8096;
public static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine
("Usage: FileConverter
);
return;
}
// Open a TextReader for the appropriate file
using (TextReader input = new StreamReader
(new FileStream (args[0], FileMode.Open),
Encoding.UTF8))
{
// Open a TextWriter for the appropriate file
using (TextWriter output = new StreamWriter
(new FileStream (args[1], FileMode.Create),
Encoding.Unicode))
{
// Create the buffer
char[] buffer = new char[BufferSize];
int len;
// Repeatedly copy data until we've finished
while ( (len = input.Read (buffer, 0, BufferSize)) > 0)
{
output.Write (buffer, 0, len);
}
}
}
}
}
Note that this demonstrates using the constructors for TextReader
and TextWriter which take streams. There are also constructors
which take filenames as parameters, so that you don't have to manually open
a FileStream in your code. Other parameters, such as the buffer
size and whether or not to detect a BOM if present, are available - see the
documentation for more details.
Top  Difficult bits
Okay, so those are the basics of Unicode. There are then lots of extra
bits, some of which have already been hinted at, and which people ought to
be aware of, even if they deem them too unlikely to be relevant for
their application to be worth sorting out. I don't offer any general
techniques or guiding principles here - I'm just trying to raise some
awareness. This is by no means an exhaustive list, either - these are just
some of the nasty bits. It's important to recognise that a lot of the
difficulty here is in no way the fault of the Unicode Consortium - just as
with dates and times and any number of other internationalisation problems,
humanity has got itself into a fundamentally tricky situation over the
course of its history.
Top  Surrogate pairs
Now that Unicode has more than 65536 characters, it can't be represented
in two bytes. This means that a .NET char value can't store all
possible values. The solution UTF-16 uses is that of surrogate pairs: pairs
of 16-bit values where each value is between 0xd8000 and 0xdfff. In other
words, two "sort of" characters make one "real" character. (UCS-4 and UTF-32
get round this problem entirely by having wider values to start with - when
everything's four bytes, you can get all possible characters in.) This is
basically a headache - it means that a string of 10 chars can actually
represent anywhere between 5 and 10 "real" Unicode characters. Fortunately,
most applications which don't involve scientific/mathematical notation and
Han characters are unlikely to need to worry too much about them. Whether or
not that applies to you is a different matter - and exactly which bits of
your code are sensitive to surrogates will also vary between applications.
Top  Combining characters
Not all characters should result in a single character being drawn on the
screen. An accented character can be represented as the unaccented character
followed by the accented combining character. Some GUI systems will support
combining characters, some won't - and the impact on your application will
depend on what assumptions you're making.
Top  Normalization
Partly due to things like combining characters, there can be several ways
of representing what is in some senses a single character. Character
sequences can be normalised to use combining characters wherever possible,
or to avoid using combining characters wherever possible. Should your
application treat two different sequences representing the same actual
character as different or the same? Do any components you need rely on
sequences being normalized in one particular way?
Top 
About
Jon Skeet
Click here if you want to know more about
Jon Skeet.
Other articles that may interest you
|