A character encoding relates the bytes in a document to characters in a human language writing system. For historical reasons, there are many character encodings, mostly specialized for certain groups of languages.
Examples are: the venerable ASCII, which is specialized for American English, the ISO-8859 series, which aims to cover large groups of alphabetic languages (always including English), VISCII, which is specialized for Vietnamese, and Unicode, the big wopper of encodings, which aims to cover all the world’s writing systems, and other character systems too.
ASCII: for info in Linux, type
man ascii
ISO-8859-* for info in Linux, type
man iso-8859-15
Also see ISO 8859 Family of Character Sets
*=1,15 are Western European writing system
*=5 is Russian/Latin
*=7 is Greek/Latin
etc.
Unicode (UTF-8): the encoding to end all encodings see Unicode site, especially Unicode code charts.
In Firefox, this is the menu View→Character Coding. This lets the user specify an encoding other than what the web page or server has advertised.
Put a meta
tag in the HTML file header to indicate the
encoding of the page. This will give your reader’s browser a clue
as to how to interpret and display the bytes in the HTML file.
For Greek/Latin,
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-7">code>
For Unicode
<meta http-equiv="Content-Type" content="text/html;
charset=iso-10646-1">code>
lang
attribute to indicate that text has
changed languages: e.g.<span lang="el">×ùñéÜôéêïó</span>
Terminals
Until recently, xterms could only handle 8-bit encodings (NOT Unicode), but modern terminals are Unicode enabled. (Even with this, rendering problems remain for complex scripts.)
Text
KDE’s KWord and Gnome’s GEdit both handle Unicode. However, their approach is quite different: GEdit assumes everything is Unicode, while KWord has you specify the encoding, both opening and saving the file.
The Gnome e-mail client Balsa is Unicode-enabled. It is also somewhat Pine-friendly.
In order to display a character from a writing system, a font must contain the corresponding glyph which is a graphical representation of the character.
Many fonts support part or most of the ISO-8859 series.
A very few support most of Unicode.
If you run
xfontsel
you can get an idea. Set “regstry” to “ISO10646” for Unicode,
then look at the available “fndry”s and “fmly”s.
There are nice open-source efforts to produce Unicode fonts, especially DejaVu, based on Bitstream’s Vera family.
Microsoft used to distribute “MS Arial Unicode”, which was one of the best. Bitstream used to distribute “Cyberbit”.
Firefox’s Preferences lets you associate a font with an encoding. Several other programs give this kind of control.