Quick Introduction to HTML
Internationalization

Introduction

Most text on the World Wide Web is in English, because at this time, that is the most widely understood language. However, there are many reasons you may wish to insert text from some other language in an English web page, or create a web page completely in some language other than English.

The main internationalization issue for the Web is one of character encoding, which is the way the data in the documented is interpreted as, say, letters in an alphabet. An encoding is necessary because, without further information, a computer file is just a sequence of bits and bytes. One must specify further that a file is, say, ASCII text, or Russian text in order to interpret these bits and bytes.

Another unfortunate historical fact is that some common punctuation marks in English (curley qotes and dashes, particularly) are encoded differently on different computer systems. If you use these, you should indicate the character encoding in your document. Otherwise, on someone else’s machine they will appear as garbage.

It is also important to indicate in the web page what language the text is meant to represent, and if the text is a mixture of languages, which text is in one language and which is in another.

Character entities

If you just want to include in your document a single non-English character, such as an accented vowel, often the easiest way is to use HTML Character Entities. These include most of the letters from the Western European alphabets. Their use side-steps the issue of encodings.

A good use of this is to insert non-English names in an English document:

HTML code Rendered
García García
Eugène Eugène
Schönberg Schönberg

It is also possible, but not recommended, to insert Greek this way:

HTML code Rendered
Απολλο Απολλο

The last example shows one of the weaknesses of this approach—the HTML is big and ugly and looks nothing like the finished product. If you want to write a whole document in an alphabet other than that of English, this is not the way to go. We will discuss better alternatives below.

The support for Greek in the HTML Character Entities is really accidental. The Greek Entities were really meant for writing math formulas, not for display of Greek text. For the same reason, there is an HTML Entity for one Hebrew letter (Aleph). But there are no entities for Russian, Arabic, or Chinese.

It is also possible to specify a single character by its numerical Unicode encoding. With this, one can in principle specify any character from any of the world’s writing systems:

HTML code Rendered
ش ش
ฒ
两

Of course, the HTML code in this approach is even less readable than with the HTML Character Encodings. One has to have a Unicode table to read it at all.

Furthermore, since HTML Entities was really meant for display of single characters, the browser may have a hard time displaying words in languages such as Arabic, where the shape of one letter depends on the surrounding letters. See below for a solution to this problem.

Character encodings

For historical and technical reasons, there are numerous character encodings, some to accommodate different languages, some invented by different companies.

If you produce your document in one encoding, and your user’s browser interprets it as another, it is likely to appear as computer-gibberish, and will at least have some misinterpreted characters.

Some important encodings include:

The Unicode (UTF) encodings represent the ultimate solution to the problem of encoding all human written languages. However, to date, not all of Unicode is completely implemented on any browser, or any platform. Whole languages are missing, fonts aren’t complete, and some platforms don’t support it at all.

The International Standards Organization (ISO ) series of encodings iso-8858 represent a fairly safe intermediate step to Unicode. For Western European languages, iso-8859-15 (Latin-9) does almost everything. The other iso-8859 encodings provide support for mixtures of Western and Eastern European languages, and a few other alphabetic writing systems, such as Arabic, Hebrew, and Thai.

Before the iso-8859 encodings, and before Unicode, many other character encodings were invented by different groups for special purposes. There are several encodings for Chinese and Japanese, and several that permit Russian, Arabic, Hebrew, Thai, etc. to be mixed with Western languages.

Specify an encoding

The de facto default encoding on the Web is Latin-1 (iso-8859-1). That means, if the user’s browser gets no other information about a page’s encoding, it will assume Latin-1 (most browsers also allow the user to alter this default, but to do so is not usually a good idea).

Unless you are very sure your document will end up as Latin-1, you should always inform your user’s browser as to which encoding you intend. Since different HTML files may have different encodings, it is best for this information to be contained in the document.

The HTML way is to place a special meta tag in the head section of the document. For example, for mixed Greek and English text, the encoding iso-8859-7 is a good choice.

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-7" />

If your document is XHTML rather than HTML, it is best to put the information in the xml tag in the first line of the document:

<?xml version="1.0" encoding="iso-8859-15"?>

For best results in XHTML, use both the xml and meta tags.

Specify language

It is considered polite to specify which languages are being used in a document. Some search engines can filter documents based on this information. For example, with the following HTML meta tag in the head section of the document, we can explain that the document contains U.S. English (en-US) and Greek (el):

<meta http-equiv="Content-Language" content="en-US,el" />

It is very useful to specify the language you mean a particular piece of text to represent. You can do this for the text in any HTML element by setting the value of its lang attribute to the code for that language. For example:

Our friend Miguel says "<span lang="es">¡Holá!</span>"

is rendered as

Our friend Miguel says "¡Holá!"

Reasons to specify the language include:

If you mean the whole HTML document to be in a specific language, you can provide that information in the lang attribute of the html tag for the document.

Note in XHTML, the way to specify language has changed.

The correct language tag in HTML5 is now lang. With XHTML 1.0, you can use both lang and xml:lang, but in XHTML 1.1, only xml:lang is acceptable.

Encodings and servers

If you don’t specify a character encoding for your document, your web server will. The usual default encoding is iso-8859-1 (Latin-1), although many servers are now switching to UTF-8 (Unicode).

The web server at a site can also specify a default encoding. This is important if most of the files at the site are written in a non-latin alphabet.

Fonts and encodings

A computer font is a collection of glyphs, which are the graphical represntations of characters in an encoding. It is this glyph that is displayed in the user’s browser window.

For the browser to display the glyph of a character in a certain encoding, a font that includes that glyph must be installed on the user’s system. Nowadays, most computer systems come with at least one fairly complete Unicode font, so for most languages there is a good chance that your viewer will see the correct glyphs.

Note it is not the business of your web page to specify the font for a language encoding. The user’s browser is meant to find a font on their system that best represents the encoding of the text,

The correct behaviour of a web browser is to identify and use a font containing the requested glyph, regardless of the font currently being used for text display. For example, if a run of text is being displayed in one font, and a character that is not supported by the font appears in the text, the browser should locate another font which does support that character, and use that font to display the character. Only when no font can be found that supports the character should the browser insert a placeholder glyph in place of the unsupported character.

The writing systems of some languages such as Cherokee and Myanmar have appeared in only a few Unicode fonts. At this time, the only solution is to explain to your user to install the proper font in case they don’t see the characters displayed properly.

Curley quotes and dashes

Users of Microsoft Windows: beware! Many Windows programs for typing text automatically make quote marks and apostrophes “curley”, so

How's "this"?

becomes

How’s “this”?

Which is good and fine, but unless the document’s character encoding is properly set, these will not appear correctly on other computer systems. Unfortunately, many Windows applications fail to do this (especially older ones, and including Microsoft Word).

A good encoding for documents in English (and other Western languages) made in Windows is “Code page 1252 - West European Latin”. The meta tag for this would be:

<meta http-equiv="Content-Type" content="text/html;charset=windows-1252" />

Other gotchas

Internet Explorer only recognizes the Unicode character encoding UTF-8 when it appears in all capitals in the meta tag

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />

Also, while other browsers will recognize XHTML documents and use the XML declaration’s encoding

<?xml version="1.0" encoding="utf-8"?>

Internet Explorer 7 ignores the XML declaration if the server tells it that the document is HTML. At the time of this writing, this is almost always the case.

In summary, if you use Unicode encoding, and you want Internet Explorer users to be able to read your document, always include the meta tag above, and be careful to captialize the UTF-8.