Text in HTML...
= Index DOT Html by Brian Wilson [bloo@blooberry.com] =

ISO 8859 Character Sets | To Encode or Not To Encode?
Entity Formats - Pros and Cons | Unicode | HTML 4.0 Entities
Related Sites
Main Index | Element Tree | Element Index | HTML Support History
ISO 8859 character sets
ISO 8859 is a set of 10 different 256-character sets used to represent a large set of the alphabetic languages used in the West. It does not address Far East languages at all. These sets were designed by the standards group ECMA (European Computer Manufacturer's Association,) and are included in the Internet charset register for use with MIME identification.
The ISO-8859-1
Character Set
Positions

000-031 | 032-064 | 065-096 | 097-126
127-159 | 160-191 | 192-223 | 224-255


Why is ISO 8859 important you might ask? The ISO 8859-1 (also called ISO-Latin) character set is the one used for HTTP (the transport protocol for web documents) and is also used in the creation of HTML documents. This character set contains all characters necessary to type all major West European languages and is also the preferred encoding on the Internet. The following languages are supported under the ISO 8859-1 character set:
Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish
ISO 8859-1 Characters - To Encode or Not To Encode?
It is acceptable to leave all ISO 8859-1 characters as unencoded character octets, but there can be no guarantee that all destination systems will understand all of the characters. In order to increase portability/viewability of the entire character set, the HTML language additionally offers alternative versions of all ISO 8859-1 characters using coded entity representations. A special syntax is used to represent these Character Entities using either a number reference or a shorthand mnemonic word.

These 'safe' entities are created using characters from the US ASCII (ISO 646) character set. Interestingly enough, the first half of the ISO 8859-1 set (character positions 000-127) is identical to those used in the US ASCII set. In fact, ASCII is always the 0-127 character position subset used in ALL ISO-8859 character sets. If safe character entity references are created using a safe portion of the ISO 8859-1 set, which characters in the ISO 8859-1 set need to be encoded, and which format should be used?

Positions 000-031 and 127-159:
The characters in the first range are non-printing characters in the HTML context and are not of any real interest to the discussion of HTML. The second range is earmarked for extended control characters, and is not used for encoding characters in HTML. The reason for this is to maintain interoperability with 7 bit devices or when the 8th bit gets stripped by faulty software. Some operating systems or code pages may use this special range for access of text characters, but this can not be relied upon.
Positions 032-064:
Includes common English punctuation and Roman numerical digits. This range does not need to be encoded except for the four reserved HTML characters (quote, ampersand, less than and greater than characters.)
Positions 065-126:
Includes uppercase and lowercase letters (A-Z and a-z) as well as common English punctuation. These characters do not need to be encoded.
Positions 159-191:
These represent special symbols. It is always safest to encode this range as character entities (numbered or named) to ensure better portability. This range has only recently gained Named Entity support for most of the characters so using Numbered Entities is recommended.
Positions 192-255:
These represent special upper and lower case accented national characters. It is always safest to encode this range as character entities (numbered or named) to ensure better portability. The HTML specifications suggest encoding this range as named entities.
Character Entity Formats - Pros and Cons
Included in the Character Entity domain are both numbered and named entities:
Numbered Entity Syntax: &#charnumber;
Where charnumber is a distinct integer from 0-255.
Named Entity Syntax: &charname;
Where charname is a unique mnemonic shorthand of the character to be represented.

Why would an author wish to use one method over the other?

Using &entityname;
Pros:
Cons:
Using &#number;
Pros:
Cons:
Special Character Cases:
HTML Reserved characters: (Less than, Greater than, Ampersand and Quotation mark)
Newer commonly used entities:

The Unicode Solution
There is a shift occurring in computer text representation. Traditionally, text is represented by a single character of data (1 byte or 8 bits) at its lowest level. This allows for 256 possible distinct characters. In languages where the entire character set exceeds this range (such as in Far East languages) two characters are used to represent a single character. Many Far East languages use their own standard sets of double byte encodings to represent character sets in each language - this compounds the problem and makes the transporting of characters and documents between language locales yet more difficult. This diversity of character sets can also lead to significant problems in the programmatic handling of character data as well.

The Unicode standard was developed to greatly reduce all this fracturing of languages into conflicting character sets. Like Far East languages, Unicode also uses 16 bits of data to represent its characters. If you look at the number of characters possible using 16 bits of data (twice the normal amount of a single 'byte'), we see that 65536 (256*256) distinct encodings are possible. All major character sets of the world (including Far East languages, symbols and dingbats) can be represented using a total of only about 35,000 of these character code points in the unicode set. Even though all the possible code points are not currently used in Unicode, there are many obscure characters and dead language writing systems that are not included in the set. Including ALL known languages, variations and symbols would be a never-ending task. A mechanism does exist, however, to expand the number of possible code points in Unicode into the millions in case of such a need.

Current software uses 7 or 8 bit encoding of characters. Unicode uses 16 bits. What would happen if a current system reads Unicode? Could be quite nasty, so there is a work-around. Unicode can be translated into sequences of 7 bit or 8 bit encodings that allow many current and old systems to interchange or transparently pass these documents without loss of content. The most popular version of this translation mechanism in use is UTF-8 (Universal character set Translation Format, 8-bit form.) This format uses variable lengths of the current standard single-byte characters to represent Unicode character code points.

The number of operating systems and applications that understand Unicode character encoding is growing, and it is the intended successor to ISO 8859-1 as the base character set used in HTML.
HTML 4.0 - Unicode instead of ISO 8859-1
HTML 4.0 uses Unicode as its base character set. With this change a whole new set of officially named and numbered character entities are introduced. Occasionally, there may be some overlap when a distinct unicode position may represent the same display character defined by another unicode position.

Unicode
Character
Entities
Arrows Arrow Shapes
Greek Capitals Greek capital characters
Greek Smalls Greek 'lower case' characters
Math Symbols Characters commonly used in mathematics
Miscellaneous letters Latin Extended-A and B characters and Letter-like Symbols
Miscellaneous shapes Playing card suit symbols and other graphical symbols
Miscellaneous technical symbols Characters used in various technical disciplines
Bi-directional and spacing characters Characters used to control bi-directional text and text spacing
General punctuation set 1 Commonly used punctuation characters
General punctuation set 2 More commonly used punctuation characters



Related Sites
Official References
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc1866.txt
RFC 1866: The HTML 2.0 specification (plain text.) Appendix contains Character Entity table.
http://www.w3.org/MarkUp/html-spec/html-spec_13.html
The web version of the HTML 2.0 (RFC 1866) Character Entity table
http://www.w3.org/MarkUp/Wilbur/
The HTML 3.2 (Wilbur) recommendation
[This includes all character entities listed in HTML 2.0 plus new named entities covering the ISO 8859-1 120-191 range.]
http://www.w3.org/TR/REC-html40/
The HTML 4.0 Recommendation
[Includes new Unicode character entities]
http://www.w3.org/International/O-HTML.html
The W3C HTML Internationalization area
http://unicode.org
The Unicode Consortium site

Other Related Links
(These sites provided many of the topics and ideas for this page)
http://www.uni-passau.de/~ramsch/iso8859-1.html
Excellent resource with good pointers on ISO-8859 issues
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/iso8859-pointers.html
Alan Flavell's excellent document of pointers to information about ISO-8859
http://ppewww.ph.gla.ac.uk/%7Eflavell/iso8859/character-faq.txt
Alan Flavell's brief FAQ document regarding ISO-8859 issues in HTML
http://www.bbsinc.com/iso8859.html
Kevin J Brewer's page with MANY links regarding character set issues.


Boring Copyright Stuff...