Google
 
   
Login
Username:

Password:


Lost Password?

Register now!
Search
Main Menu
top books
Polls
What do you think about php-deluxe.net?
Excellent!
Cool
Hmm..not bad
What the hell is this?
encyclopedia
recommendation
compare webbrowser
Freenet DSL
Who's Online
4 user(s) are online (4 user(s) are browsing encyclopedia)

Members: 0
Guests: 4

more...
browser tip
Unix Befehle
manual of unix befehle
recommendation!
Sponsored
partner

Unicode and HTML

The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and World Wide Web users alike. The accurate representation of text, in web pages, from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, and varying levels of support by web browsers.

= HTML document characters =

Web pages are typically HTML documents. , as their primary document character set . That is, an HTML document is and must be composed of a sequence of Unicode characters.

When computer storage on a file system or transmitted over a computer network, these characters are encoded as a sequence of Bit octets ( bytes ) according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8, that can directly encode any Unicode character, or a legacy encoding, like ISO 8859-1#Windows-1252, that can t.

== Numeric character references ==

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a number, in which case it must be prefixed by x. The characters that comprise the numeric character reference are universally representable in every encoding approved for use on the Internet.

For example, a Unicode code point like 33865 (decimal), which corresponds to a particular Chinese character, has to be preceded by &# and followed by ;, like this: 葉, which produces this: 葉 (if it doesn t look like a Chinese character, see the special characters note at bottom of article).

The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers — but they will probably have a problem displaying Unicode characters above code point 255 anyway. It is still a common practice to convert the hexadecimal code point into a decimal value (for example ♠ instead of ♠).

== Named character entities ==

In HTML, there is a standard set of 252 named character entities for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and more widely supported by older browsers.

Character entities can be included in an HTML document via the use of entity references , which take the form &EntityName;, where EntityName is the name of the entity. For example, —, much like — or —, represents 2014: the em dash character — like this — even if the character encoding used doesn t contain that character.

=Character encoding determination=

In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. When a document is transmitted via a . For a browser from a location where multibyte character encodings are the norm, some form of autodetection is likely to be applied.

Because of the legacy of 8-bit text representations in programming languages and operating systems, and the desire to avoid burdening users with needing to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk, and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. It is also a common misunderstanding that the encoding declaration effects a change in the actual encoding, whereas it is actually just a label that could be inaccurate.

Many HTML documents are served with inaccurate encoding declarations, or no declarations at all. In order to determine the encoding in such cases, many browsers allow the user to manually select one from a list. They may also employ an encoding autodetection algorithm that works in concert with the manual override. The manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. This has been addressed somewhat by XHTML, which, being XML, requires that encoding declarations be accurate and that no workarounds be employed when they re found to be inaccurate.

=Web browser support=

Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points:

Some web browsers, such as Mozilla Firefox, Opera (web browser), and Safari (web browser), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system.

Internet Explorer is capable of displaying the full range of Unicode characters, but can t automatically make the necessary font choice. Web page authors must guess which appropriate fonts might be present on users systems, and manually specify them for each block of text with a different language or Unicode range. A user may have another font installed which would display some characters, but if the web page author hasn t specified it, then Explorer will fail to display them, and show placeholder squares instead.

Older browsers, such as Netscape Navigator 4.77, can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.

For displaying characters outside the Basic Multilingual Plane, like the Gothic letter faihu in the table above, some systems (like Windows 2000) need manual adjustments of their settings.

=See also=

  • .
  • = External links =

    *[http://www.w3.org/TR/unicode-xml/ Unicode in XML and other Markup Languages] - a W3C & Unicode Consortium joint publication that describes issues and provides guidelines relating to Unicode in markup languages *[http://www.w3.org/TR/REC-html40/HTMLlat1.ent Latin-1], [http://www.w3.org/TR/REC-html40/HTMLspecial.ent Special ], and [http://www.w3.org/TR/REC-html40/HTMLsymbol.ent Mathematical, Greek and Symbolic] named character entity definitions for HTML 4.01 *[http://scripts.sil.org/cms/scripts/page.phpsite_id=nrsi&id= SIL s freeware fonts, editors and documentation] *[http://www.alanwood.net/unicode/ Alan Wood’s Unicode Resources] - Unicode fonts and information (www.alanwood.net/unicode). *http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm The International Phonetic Alphabet in Unicode *http://www.alanwood.net/unicode/cjk_compatibility_ideographs.html CJK Compatibility Ideographs *http://www.unicode.org/charts/ Unicode character charts; hexadecimal numbers only; PDF files showing all characters independent of browser capabilities *[http://unicode.coeurlumiere.com/ Table of Unicode characters from 1 to 65535] - shows how they look in one s browser *[http://www.pinyin.info/tools/converter/chars2uninumbers.html Web tool that converts special characters (such as Chinese characters) to Unicode numeric character references]