Correct use of accented and special characters on web pages
or how to avoid mojibake
Introduction and explanations
All the printable characters can be divided into two groups, ASCII and extended. ASCII characters are those that can be entered using the US (QWERTY) keyboard. They consist of 26 Latin lowercase and 26 uppercase unaccented letters (a–z, A–Z), 10 digits (0–9), small selection of punctuation and brackets, and 16 special characters (~ @ # $ % ^ & * _ + = < > / \ |). These characters will display correctly in all circumstances. Extended characters are all other characters. There are so many of them that it would be impossible to list them all on this page (100,000+, the number is increased with most releases of a new Unicode standard). There are accented Latin letters, Cyrillic letters, Greek letters, Arabic letters, Hebrew letters, Korean letters, letters used in languages of Indian subcontinent, traditional Chinese, simplified Chinese, and Japanese ideograms, and numerous punctuation and special characters. When using extended characters, simple precautions must be taken so that they will display correctly irrespective of language settings, default encoding and operating system of the reader.
Due to historical reasons, extended characters are organised in numerous small sets called code pages. A code page is an array with 256 positions, of which the first 128 are identical (in case of those used on the internet, one notable exception is the family of EBCDIC code pages used on IBM mainframes, there are also some historical code pages that are now obsolete). These 128 first characters are called ASCII characters. This leaves up to 128 positions for other characters. Because there are many more national and special characters than 128, these characters were grouped into code pages. There are dozens of code pages: 15 ISO 8859 codepages (numbered 1 to 16, there is no ISO 8859_12), 8 Windows-125x codepages which contrary to popular beliefs and Microsoft propaganda are not the same as ISO codepages, a few codepages used for Far Eastern languages, numerous Mac and DOS codepages (now mostly extinct and unlikely to be encountered on the internet) and finally Unicode represented by a nymber of transformations, most popular of which is UTF-8. In many circumstances the reader’s software must receive information about codepage used, otherwise it may or may not display these characters correctly; correct display will be accidental and will depend on having certain settings to be identical to the settings of the system supplying content (web page author’s system, server). This information may be transmitted in a few ways. Other method is special form of encoding of extended characters. Please remember that English is a very international language and you should never assume that everybody will have English settings just because he/she wants to read a webpage that is in English! Sometimes the author may not even be aware that he/she used an extended character as most word processors have an autocorrection function that for example replaces a hyphen or a series of up to three hyphens (ASCII character) with an en dash or em dash (extended characters), straight typewriter quotes ("ASCII characters") with typographically correct curly quotes (“extended characters”), (c)/(r)/(tm) (which use only ASCII characters) with ©/®/™ (extended) and so on. If the author uses a software that is not aware of encodings, and does not take certain (simple) precautions, such page may (and usually will) display incorrectly on computers that use different operating system and/or use different regional settings and/or use different web browsers.
Such mix-up of characters (for example displaying a Russian text as if it was a string of Western European characters) results in gibberish and is colloquially called “mojibake”. While usually it is possible to change certain settings (like selecting manually other encoding) to display such web page correctly, less advanced users will not be able to recognise the problem and may be prevented from accessing any information on such web page.
Declaration of encoding
The information about codepage used may be transmitted in three ways, listed in order of priority:
- Transmitted by the server. This is the most correct way. Unfortunately authors of webpages frequently have no access to the server configuration files.
- Encoding declaration in the file itself. Both HTML and XHTML files can include information about encoding used. If the XHTML file is sent with MIME type text/html, it should include declaration for HTML rather than XHTML. It is also possible to declare character set used in a CSS stylesheet. It is strongly recommended to use this type of declaration even when server supplies encoding declaration, so that pages saved by visitors to their hard drives will still display correctly. Encoding of external js and css files will also be inferred from the BOM (if present in the file).
- Encoding declaration in link code. This declaration informs the browser about encoding used for the file linked by the link. Declarations in links to other files (<a href="file" charset="utf-8">) are ignored by most browsers (and there are plans to drop it from the next version of HTML and XHTML standards), however, it seems that character encoding declaration in other types of links (<link rel="stylesheet" href="style.css" charset="utf-8">, <script src="script.js" charset="utf-8">) is well supported.
- Default encoding:
- Default for XHTML files is UTF-8 and UTF-16 (browsers will automatically detect which transformation is used by BOM, absence of which means that UTF-8 is used, UTF-16 is rarely, if ever, used). This will work only if the XHTML document is sent as application/xhtml+xml.
- With HTML the situation is more difficult. There exists an old RFC that declares ISO-8859-1 as the default but all browsers provide a setting to select default encoding and users normally select one appropriate for their language. Due to the fact that English is a de facto international language and is understood by a large number of speakers of other languages (including speakers of languages that do not belong to the family of Western European languages), you should never assume that those reading your web page will be using Western European settings. W3C also warns against assuming that readers will have their browsers set to any particular encoding, also their HTML validator assumes UTF-8 for documents lacking encoding information. Hence only HTML pages using ASCII characters may omit the encoding declaration, only CSS stylesheets using ASCII characters or encoded using the same encoding as the HTML file they are linked too may omit encoding declaration. This also concerns the XHTML pages that are sent as text/html.
For XHTML encoding declaration looks like this: <?xml version="1.0" encoding="utf-8"?> (other encoding can be used in place of utf-8). This line must be put as the first one in the file.
For HTML (including XHTML 1.0 send as text/html) encoding declaration looks like this: <meta http-equiv="content-type" content="text/html; charset=utf-8"> (other encoding can be used in place of utf-8). This line must be put in the head section (between <head> and </head>), preferably as the first line in this section.
For CSS: @charset "utf-8"; (other encoding can be used in place of utf-8). This line must be put as the first one in the file.
See also: W3C I18N Tutorial: Character sets & encodings in XHTML, HTML and CSS.
Encoding extended characters
Extended characters may be encoded in a special way. When all extended characters are encoded in this way, there is no need to declare encoding, encoded characters will be displayed correctly irrespective of any encoding declaration, wether sent by the server or declared in the document.
- In HTML 4 and XHTML documents numerical character references (NCR) may be used. A NCR starts with ampersand and hash, then follows decimal number of position of the character in Unicode, and is terminated by semicolon. Thus a pound sign, which is located on position number 163, may be encoded as £, euro character (on position 8364) may be encoded as €, and a with umlaut (position 228) may be encoded as ä. A variation of this system uses additional x and hexadecimal rather than decimal number of the character’s position in Unicode: pound sign will be encoded as £, euro character will be encoded as AC;, and a with umlaut as ä.
- For some characters so-called named entities also exist. They start with ampersand, which is followed by an abbrevation of the character’s descriptive name, and is terminated by semicolon, The pound may be encoded as £, euro as €, and a with umlaut as ä. These entities may be used in all HTML 4 documents and in the XHTML documents whose declared DTD defines these entities.
- In CSS style sheets an excaped sequence starts with backslash, which is followed by exactly 6-digit hex number of the character’s Unicode position. If necessary, leading zeros must be added. It is also possible to omit leading zeros and terminate the sequence with a white space (space, hard space, tab, line break, etc.) or a character that is neither a digit or a letter a to f. Pound sign will be escaped as \0000A3 or \A3 , euro sign will be escaped as \008364 or \8364 , and a with umlaut as \0000E4 or \E4 (gray background should help making the trailing space more visible). (In CSS1 escaped sequences were exactly four digit long).
Using extended characters without declaring character encoding is the most frequent reason for mojibake. Other reasons for mojibake include:
- Mismatch between encoding declaration and encoding actually used in the document. It has been noted on several web pages that UTF-8 is declared but the text is actually encoded in ISO-8859-1 or Windows-1252. It also happens in case of webpages in other languages (for example declared encoding is ISO-8859-2 while actually used encoding is Windows-1250).
- A particularly frequent error is declaring ISO-8859-1 while actually using Windows-1252. Contrary to beliefs that appear to be very common among Westerners, these two encodings are not identical, Windows-1252 uses the range of 0x80–0x9F (128–159 in decimal) for printable characters while in ISO-8859-1 these positions are used for control characters. This error was introduced by Microsoft long time ago and seems to persist like an untreated VD. While MS Internet Explorer and some other browsers for Windows seem to be immune to this error and display these characters correctly (provided the web page is sent as text/html), it is by no means a standard and especially non-Windows browsers are likely to display squares or question marks instead of the intended characters. Similar problem results from confusing ISO-8859-1 or Windows-1252 with US-ASCII, but this is less common on the web. Other ISO-8859 encodings are even more different from respective Windows-125x encodings so for example declaring ISO-8859-2 while actually using Windows-1250 will make large parts of text unreadable. In case of encodings other that ISO-8859-1 and Windows-1252 even Windows browsers including Internet Explorer fail to display any characters from Windows-125x range of 128–159 when ISO encoding is declared.
- Some software for authoring webpages, even expensive commercial one, does not support encodings. Using this sort of software requires special care. Webpages created using this type of software will be encoded in system encoding of the computer they were created on. If it is run under Western Windows, the page will be encoded in Windows-1252 (not ISO-8859-1!), if it is run under an older Western Macintosh, it will be encoded in MacRoman, if it is run under Western MacOSX or Linux, it will most probably be encoded in ISO-8859-1. Even if the editor displays the page correctly, don’t assume that it will be displayed correctly on everybody’s computers, this is especially true if you go through the authoring program’s menus and cannot find any setting for encoding.
- Mojibake will also result when extended characters are correctly encoded and declared, but the server sends incorrect declaration. For example typical installation of Apache is configured to send out ISO-8859-1 declaration for all HTML documents which may or may not be correct. Correcting this will usually require the administrator’s intervention. See also Setting the HTTP charset parameter.
Examples of NCRs, entities and escaped sequences
|NCR (dec)||NCR (hex)||named entity||incorrect||correct||incorrect|
|–||–||–||–||– – &endash;||\u2013||\x96 \u0096||\002013|
|—||—||—||—||— — &emdash;||\u2014||\x97 \u0097||\002014|
|“||“||“||“||“ “||\u201C||\x93 \u0093||\00201C|
|”||”||”||”||” ”||\u201D||\x94 \u0094||\00201D|
|‘||‘||‘||‘||‘ ‘||\u2018||\x91 \u0091||\002018|
|’||’||’||’||’ ’||\u2019||\x92 \u0092||\002019|
|…||…||…||…||… …||\u2026||\x85 \u0085||\002026|
|‰||‰||‰||‰||‰ ‰||\u2030||\x89 \u0089||\002030|
|•||•||•||•||• •||\u2022||\x95 \u0095||\002022|
|‹||‹||‹||‹||‹ ‹||\u2039||\x8B \u008B||\002039|
|›||›||›||’||› ›||\u203A||\x9B \u009B||\00203A|
|†||†||†||†||† †||\u2020||\x86 \u0086||\002020|
|‡||‡||‡||‡||‡ ‡||\u2021||\x87 \u0087||\002021|
|™||™||™||™||™ ™||\u2122||\x99 \u0099||\002122|
|€||€||€||€||€ €||\u20AC||\x80 \u0080||\0020AC|
|"||"||"||"||—||\" \u0022 \x22||—||\" \000022|
|'||'||'||(') ¹)||(') ¹)||\' \u0027 \x27||—||\' \000027|
|&||&||&||&||—||— ²)||—||— ²)|
¹) named entity “'” exists only in XHTML, it is not supported by Internet Explorer (which does not support XHTML). It must not be used in HTML 4 documents.
²) ampersand does not require escaping in JS or CSS.
Software with encoding support
The list is by no means exhaustive and includes only open source and free software. Unless indicated otherwise, all these programs will save extended characters as NCRs or named entities if these characters cannot be encoded in selected encoding. This includes characters present in Windows-1252 and absent from ISO-8859-1, when ISO-8859-1 is selected as the encoding for the file being saved.
- Nvu and Kompozer – multiplatform: for Windows, Linux and MacOS X, Open Source. Originally by Mozilla Foundation. Supports numerous encodings (ISO, Windows and UTF-8), and NCRs and named entities. Nvu is available in a number of language versions, for Kompozer a number of language packs is available.
- Amaya – multiplatform, powerful editor (HTML, XHTML, CSS, MathML, partial support for SVG), Open Source. By W3C. Supports numerous encodings (ISO, Windows and UTF-8 but will save only in the same encoding, ISO-8859-1, UTF-8 or US-ASCII), and NCRs and named entities. Interface in English only.
- Open Office – multiplatform: for Windows, Linux and MacOS X, LGPL. A very good office suite, OpenOffice Writer can export to HTML. Supports numerous encodings (ISO, Windows and UTF-8), and NCRs and named entities. Numerous language versions available.
Allan Wood’s Unicode and multilingual support in HTML, fonts, Web browsers and other applications – includes a much more comprehensive list of software.