Home
Po polsku
 

Correct use of accented and special characters on web pages
or how to avoid mojibake

Contents

Introduction and explanations

All the printable characters can be divided into two groups, ASCII and extended. ASCII characters are those that can be entered using the US (QWERTY) keyboard. They consist of 26 Latin lowercase and 26 uppercase unaccented letters (a–z, A–Z), 10 digits (0–9), small selection of punctuation and brackets, and 16 special characters (~ @ # $ % ^ & * _ + = < > / \ |). These characters will display correctly in all circumstances. Extended characters are all other characters. There are so many of them that it would be impossible to list them all on this page (100,000+, the number is increased with most releases of a new Unicode standard). There are accented Latin letters, Cyrillic letters, Greek letters, Arabic letters, Hebrew letters, Korean letters, letters used in languages of Indian subcontinent, traditional Chinese, simplified Chinese, and Japanese ideograms, and numerous punctuation and special characters. When using extended characters, simple precautions must be taken so that they will display correctly irrespective of language settings, default encoding and operating system of the reader.

Due to historical reasons, extended characters are organised in numerous small sets called code pages. A code page is an array with 256 positions, of which the first 128 are identical (in case of those used on the internet, one notable exception is the family of EBCDIC code pages used on IBM mainframes, there are also some historical code pages that are now obsolete). These 128 first characters are called ASCII characters. This leaves up to 128 positions for other characters. Because there are many more national and special characters than 128, these characters were grouped into code pages. There are dozens of code pages: 15 ISO 8859 codepages (numbered 1 to 16, there is no ISO 8859_12), 8 Windows-125x codepages which contrary to popular beliefs and Microsoft propaganda are not the same as ISO codepages, a few codepages used for Far Eastern languages, numerous Mac and DOS codepages (now mostly extinct and unlikely to be encountered on the internet) and finally Unicode represented by a nymber of transformations, most popular of which is UTF-8. In many circumstances the reader’s software must receive information about codepage used, otherwise it may or may not display these characters correctly; correct display will be accidental and will depend on having certain settings to be identical to the settings of the system supplying content (web page author’s system, server). This information may be transmitted in a few ways. Other method is special form of encoding of extended characters. Please remember that English is a very international language and you should never assume that everybody will have English settings just because he/she wants to read a webpage that is in English! Sometimes the author may not even be aware that he/she used an extended character as most word processors have an autocorrection function that for example replaces a hyphen or a series of up to three hyphens (ASCII character) with an en dash or em dash (extended characters), straight typewriter quotes ("ASCII characters") with typographically correct curly quotes (“extended characters”), (c)/(r)/(tm) (which use only ASCII characters) with ©/®/™ (extended) and so on. If the author uses a software that is not aware of encodings, and does not take certain (simple) precautions, such page may (and usually will) display incorrectly on computers that use different operating system and/or use different regional settings and/or use different web browsers.

Such mix-up of characters (for example displaying a Russian text as if it was a string of Western European characters) results in gibberish and is colloquially called “mojibake”. While usually it is possible to change certain settings (like selecting manually other encoding) to display such web page correctly, less advanced users will not be able to recognise the problem and may be prevented from accessing any information on such web page.

Declaration of encoding

The information about codepage used may be transmitted in three ways, listed in order of priority:

  1. Transmitted by the server. This is the most correct way. Unfortunately authors of webpages frequently have no access to the server configuration files.
  2. Encoding declaration in the file itself. Both HTML and XHTML files can include information about encoding used. If the XHTML file is sent with MIME type text/html, it should include declaration for HTML rather than XHTML. It is also possible to declare character set used in a CSS stylesheet. It is strongly recommended to use this type of declaration even when server supplies encoding declaration, so that pages saved by visitors to their hard drives will still display correctly. Encoding of external js and css files will also be inferred from the BOM (if present in the file).
  3. Encoding declaration in link code. This declaration informs the browser about encoding used for the file linked by the link. Declarations in links to other files (<a href="file" charset="utf-8">) are ignored by most browsers (and there are plans to drop it from the next version of HTML and XHTML standards), however, it seems that character encoding declaration in other types of links (<link rel="stylesheet" href="style.css" charset="utf-8">, <script src="script.js" charset="utf-8">) is well supported.
  4. Default encoding:

For XHTML encoding declaration looks like this: <?xml version="1.0" encoding="utf-8"?> (other encoding can be used in place of utf-8). This line must be put as the first one in the file.

For HTML (including XHTML 1.0 send as text/html) encoding declaration looks like this: <meta http-equiv="content-type" content="text/html; charset=utf-8"> (other encoding can be used in place of utf-8). This line must be put in the head section (between <head> and </head>), preferably as the first line in this section.

For CSS: @charset "utf-8"; (other encoding can be used in place of utf-8). This line must be put as the first one in the file.

See also: W3C I18N Tutorial: Character sets & encodings in XHTML, HTML and CSS.

Encoding extended characters

Extended characters may be encoded in a special way. When all extended characters are encoded in this way, there is no need to declare encoding, encoded characters will be displayed correctly irrespective of any encoding declaration, wether sent by the server or declared in the document.

Common errors

Using extended characters without declaring character encoding is the most frequent reason for mojibake. Other reasons for mojibake include:

Examples of NCRs, entities and escaped sequences

Cha­rac­ter(X)HTMLJavaScriptCSS
NCR (dec)NCR (hex)named en­tityincor­rectcorr­ectincor­rect
&#8211;&#x2013;&ndash;&#150; &#x96; &endash;\u2013\x96 \u0096\002013
&#8212;&#x2014;&mdash;&#151; &#x97; &emdash;\u2014\x97 \u0097\002014
&#8220;&#x201C;&ldquo;&#147; &#x93;\u201C\x93 \u0093\00201C
&#8221;&#x201D;&rdquo;&#148; &#x94;\u201D\x94 \u0094\00201D
&#8216;&#x2018;&lsquo;&#145; &#x91;\u2018\x91 \u0091\002018
&#8217;&#x2019;&rsquo;&#146; &#x92;\u2019\x92 \u0092\002019
&#8230;&#x2026;&hellip;&#133; &#x85;\u2026\x85 \u0085\002026
&#8240;&#x2030;&permil;&#137; &#x89;\u2030\x89 \u0089\002030
&#8226;&#x2022;&bull;&#149; &#x95;\u2022\x95 \u0095\002022
&#8249;&#x2039;&lsaquo;&#139; &#x8B;\u2039\x8B \u008B\002039
&#8250;&#x203A;&rsquo;&#155; &#x9B;\u203A\x9B \u009B\00203A
&#8224;&#x2020;&dagger;&#134; &#x86;\u2020\x86 \u0086\002020
&#8225;&#x2021;&Dagger;&#135; &#x87;\u2021\x87 \u0087\002021
©&#169;&#xA9;&copy;\u00A9 \xA9\0000A9
®&#174;&#xAE;&reg;\u00AE \xAE\0000AE
&#8482;&#x2122;&trade;&#153; &#x99;\u2122\x99 \u0099\002122
&#8364;&#x20AC;&euro;&#128; &#x80;\u20AC\x80 \u0080\0020AC
¢&#162;&#xA2;&cent;\u00A2 \xA2\0000A2
£&#163;&#xA3;&pound;\u00A3 \xA3\0000A3
<&#60;&#x3C;&lt;\u003C \x3C\00003C
>&#62;&#x3E;&gt;\u003E \x3E\00003E
«&#171;&#xAB;&laquo;\u00AB \xAB\0000AB
»&#187;&#xBB;&raquo;\u00BB \xBB\0000BB
"&#34;&#x22;&quot;\" \u0022 \x22\" \000022
'&#39;&#x27;(&apos;) ¹)(&apos;) ¹)\' \u0027 \x27\' \000027
&&#38;&#x26;&amp;²)²)

_________
¹) named entity “&apos;” exists only in XHTML, it is not supported by Internet Explorer (which does not support XHTML). It must not be used in HTML 4 documents.

²) ampersand does not require escaping in JS or CSS.

Software with encoding support

The list is by no means exhaustive and includes only open source and free software. Unless indicated otherwise, all these programs will save extended characters as NCRs or named entities if these characters cannot be encoded in selected encoding. This includes characters present in Windows-1252 and absent from ISO-8859-1, when ISO-8859-1 is selected as the encoding for the file being saved.

Allan Wood’s Unicode and multilingual support in HTML, fonts, Web browsers and other applications – includes a much more comprehensive list of software.

counter