Encodings, web pages, and Linux

character encoding

A character encoding relates the bytes in a document to characters in a human language writing system. For historical reasons, there are many character encodings, mostly specialized for certain groups of languages.

Examples are: the venerable ASCII, which is specialized for American English, the ISO-8859 series, which aims to cover large groups of alphabetic languages (always including English), VISCII, which is specialized for Vietnamese, and Unicode, the big wopper of encodings, which aims to cover all the world’s writing systems, and other character systems too.

ASCII: for info in Linux, type
man ascii
ISO-8859-* for info in Linux, type man iso-8859-15
Also see ISO 8859 Family of Character Sets
*=1,15 are Western European writing system
*=5 is Russian/Latin
*=7 is Greek/Latin
etc.
Unicode (UTF-8): the encoding to end all encodings see Unicode site, especially Unicode code charts.

encodings for Web pages

In Firefox, this is the menu View→Character Coding. This lets the user specify an encoding other than what the web page or server has advertised.
Put a meta tag in the HTML file header to indicate the encoding of the page. This will give your reader’s browser a clue as to how to interpret and display the bytes in the HTML file.

For Greek/Latin,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-7">code> For Unicode
<meta http-equiv="Content-Type" content="text/html; charset=iso-10646-1">code>
Use the HTML lang attribute to indicate that text has changed languages: e.g.
<span lang="el">×ùñéÜôéêïó</span>
The two-letter language codes are standardized by ISO 3166 ; you recognize them from Internet domain names.
It goes without saying: don’t assume your reader has a particular font on their machine. So don’t format your page so it looks good in a particular, weird font with some screwy encoding, and fail to inform the reader of what encoding you’re using.

encodings in Linux

Terminals

Until recently, xterms could only handle 8-bit encodings (NOT Unicode), but modern terminals are Unicode enabled. (Even with this, rendering problems remain for complex scripts.)
Text

KDE’s KWord and Gnome’s GEdit both handle Unicode. However, their approach is quite different: GEdit assumes everything is Unicode, while KWord has you specify the encoding, both opening and saving the file.
E-mail

The Gnome e-mail client Balsa is Unicode-enabled. It is also somewhat Pine-friendly.

encodings and fonts

In order to display a character from a writing system, a font must contain the corresponding glyph which is a graphical representation of the character.

Many fonts support part or most of the ISO-8859 series. A very few support most of Unicode. If you run
xfontsel
you can get an idea. Set “regstry” to “ISO10646” for Unicode, then look at the available “fndry”s and “fmly”s.

There are nice open-source efforts to produce Unicode fonts, especially DejaVu, based on Bitstream’s Vera family.

Microsoft used to distribute “MS Arial Unicode”, which was one of the best. Bitstream used to distribute “Cyberbit”.

Firefox’s Preferences lets you associate a font with an encoding. Several other programs give this kind of control.