| /Home/Papers/Encoding Issues | /\/\ountain/\/\inds |
Content
Site Tools+SSL
|
Notes on Character Set Encoding Issues
Marc R. Hoffmann, June 2004
WORK IN PROGRESS DRAFT
[Introduction] [ MIME, PDF, XML, unicode, glyphs, Single Byte, Multi Byte, Fixed Length, Variable Length (self containing or not), pattern matching, indexing, search, sorting, http://www.iana.org/assignments/character-sets, http://www.i18ngurus.com/, http://www.cs.tut.fi/~jkorpela/chars.html, http://home.att.net/~jameskass/, http://en.wikipedia.org/wiki/Unicode, http://www.microsoft.com/globaldev/reference/wincp.mspx, http://czyborra.com/utf/, http://zsigri.tripod.com/fontboard/cjk/charsets.html, http://lfw.org/text/jp-www.html, http://www.faqs.org/rfcs/rfc1468.html, http://www.faqs.org/rfcs/rfc2237.html, http://homepages.cwi.nl/~dik/english/codes/stand.html Arial Unicode MS ] DefinitionsCharacter and Character SetAccording to the Unicode specification a character is the abstract representation of the smallest component of written language that have semantic value like letters, number symbols, white spaces and punctuation. In addition definitions may be made for technical, graphical, musical or currency symbols as well as for pure technical control commands. A set of such definitions is called a character set or sometimes a repertoire. The definitions are typically provided as descriptive text like "LATIN CAPITAL LETTER M", "DEVANAGARI LETTER VOCALIC R" or "MUSICAL SYMBOL G CLEF". But definitions do not refer to a particular graphical representations that may vary for different uses (see section about glyphs below). [todo "smallest component"] For electronic processing it make sense to enumerate character definitions with positive integers. Such a number assigned to a character definition is called its code point, code position or just code. The set of integer values to represent a character set is called the code space. [todo reference to UCS, Unicode etc.] Character EncodingModern computer systems store data as octet (or byte) sequences. A particular scheme to map code positions to octet streams is called character encoding. Many different standards for character encodings do exist and are in use. This paper is about those standards, how to use them correctly and what problems may occur if wrong character encodings are applied. Glyphs and FontsA glyph is a concrete graphical representation for a particular character. Different representations may be used for the same character, e.g. for "LATIN CAPITAL LETTER A":
A particular set of glyphs for rendering a character set (or a subset of it) is called a font. Note that there is not necessarily a 1:1 relationship between characters and glyphs. Depending on the character set definition and writing system multiple characters may form a single glyph. Also a single character may require multiple glyphs for its graphical representation. Moreover different glyphs may be used to represent the same character depending on its context. [todo provide examples]. While graphical character representations are not in the scope of this documents sometimes illustrative glyphs appear in the provided examples. Encodings
Encoding ProblemsEncoding problems occur when different encoding schemes are used for converting characters in octets and vize versa. Typically this results in broken character representations in some point of processing textual information. Depending on the strictness of the decoder implementation it may also break decoding a particular octet stream at all. In detail the following symptoms may come along with encoding problems. Knowing them can help you to discover the cause of the problem. Single broken Characters
Nearly all encodings have the same code positions for the so called
basic latin characters within the 7-bit Unicode range
Assume you have created a text document with a simple editor under MS Windows. The default encoding for english locales is the Microsoft Windows Codepage 1252 (Latin I) and will encode your text as follows:
Uploading this file e.g. to an web server that declares it pages to be encoded
in ISO-8859-1 will result in a broken representation as the positions
Unexpected Extra CharactersThis situation occurs if a multi byte encoding is used for encoding some text. This is e.g. the case for non-ASCII characters when encoded in UTF-8:
Note that the letter é is encoded with two octets
The other way around it may also happen that multiple characters get combined to a single character if a multi byte encoding is applied to a single byte encoded text. But this situation is rather theoretical, more likely it will lead to invalid octet streams as described in the next section. Broken DecodingSingle byte encodings like ISO-8859-1 can produce any octet sequences. There are no constraints for octet values that may follow other values.
On the other hand, some multi byte encodings follow certain rules that put constraints on the octet values that can follow each other. In turn this means that octet streams may represent invalid input data for such encodings. A good example for such a encoding is UTF-8 that can not decode the octet stream produced above.
Depending on the strictness of the decoder implementation such sequences may be ignored. In other implementations decoding may fail at all. For example the XML specification requires XML parsers to report a fatal error in such situations: It is a fatal error if an XML entity is determined [...] to be in a certain encoding but contains byte sequences that are not legal in that encoding. Completely scrambled text[to be done: provide practical example, that shows how this may break protocols] Other topics[Provide browser test-script that can be parameterized for: MIME-type, Contenttype, contentencoding, content. Provide graphical glyph viewer, Automatic detection for html forms that include a hidden input test string with character references] |
| last modified: 2005/12/12 21:45 | user: Anonymous User [login] |