- HTML Tutorial
- Character Data in HTML
- How does HTML work
|
Character Data in HTML
When you write the text of your HTML page, the characters you use must belong to a known character set. This character set is specified by the "charset" parameter of the "text/html" media type, and it is usually "ISO-8859-1", although it could also be the more restricted "US-ASCII". The character set (charset) ISO-8859-1 is also known as Latin Alphabet No. 1, or simply Latin-1. To specify this charset you should include the following in the HEAD section of the page:
<META HTTP-EQUIV="content-type" CONTENT=
"text/html; charset=ISO-8859-1">
Latin-1 includes characters from most Western European languages, as well as a number of control characters. Control characters are non-printable characters that are typically used for communication and device control, as format markers, and as data delimiters.
Control characters
In HTML the use of control characters is limited in order to maximize the chance of successful interchange over heterogeneous networks and operating systems. Only three control characters are used: Horizontal Tab (HT, encoded as 9 decimal in US-ASCII and ISO-8859-1), Carriage Return, and Line Feed.
Horizontal Tab is interpreted as a space in all contexts except pre-formatted text. Within pre-formatted text, the tab should be interpreted to shift the horizontal column position to the next position which is a multiple of 8 on the same line.
Carriage Return and Line Feed are conventionally used to represent end of line. For media types defined as "text/*", the sequence CR LF is used to represent an end of line. In practice, text/html documents are frequently represented and transmitted using an end of line convention that depends on the conventions of the source of the document; that representation may consist of CR only, LF only, or CR LF combination. Documents prepared in a computer running Microsoft's operating systems have their ends of line marked with both characters, while those prepared in Unix systems use only LF, and those in Apple computers use only CR.
In HTML, end of line in any of its variations is interpreted as a space in all contexts except pre-formatted text. Within pre-formatted text, browsers are expected to treat any of the three common representations of end-of-line as starting a new line.
Special characters
Certain characters have special meaning in HTML documents. There are two printing characters which may be interpreted by an HTML application to have an effect on the format of the text:
Space
· Interpreted as a word space (place where a line can be broken) in all contexts except the Pre-formatted Text element.
· Interpreted as a non-breaking space within the Pre-formatted Text element.
Hyphen
· Interpreted as a hyphen symbol in all contexts
· Interpreted as a potential word space when hyphenating the document
Certain characters are part of the HTML markup, and when used in the content's text should be replaced by entity references, always prefaced with ampersand (&) and followed by a semicolon. These characters are as follows:
Symbol Entity Name Description
< lt Less than sign
> gt Greater than sign
& amp Ampersand
" quot Double quote sign
So that these characters will not be interpreted as markup, they must be represented by entity references. For example, this line of a program written in the C language,
if ( var > 125 && var < 250 ) {
when included in an HTML page, should be coded as:
if ( var > 125 && var < 250 ) {
Character Entities
Many of the Latin alphabet No. 1 set of printing characters may be represented within the text of an HTML document by a character entity. The reasons for using a character entity are:
· the keyboard does not provide a key for the character, such as on U.S. keyboards which do not provide European characters
· the character may be interpreted as markup, such as the ampersand (&), double quotes ("), the lesser (<) and greater (>) characters
The HTML DTD includes a character entity for each of the printing characters in the character set Latin-1, so that one may reference them by name if it is inconvenient to enter them directly. To ensure that a string of characters is not interpreted as markup, represent all occurrences of <, >, and &; by character or entity references.
The following entity names are used in HTML, always prefaced with ampersand (&) and followed by a semicolon.
Table of character entities
| Name |
Symbol |
Description |
| Aacute |
á |
Capital A, acute accent |
| aacute |
á |
Small a, acute accent |
| Acirc |
â |
Capital A, circumflex accent |
| acirc |
â |
Small a, circumflex accent |
| AElig |
æ |
Capital AE diphthong (ligature) |
| aelig |
æ |
Small ae diphthong (ligature |
| Agrave |
à |
Capital A, grave accent |
| agrave |
à |
Small a, grave accent |
| Aring |
å |
Capital A, ring |
| aring |
å |
Small a, ring |
| Atilde |
ã |
Capital A, tilde |
| atilde |
ã |
Small a, tilde |
| Auml |
ä |
Capital A, dieresis or umlaut mark |
| auml |
ä |
Small a, dieresis or umlaut mark |
| Ccedil |
ç |
Capital C, cedilla |
| ccedil |
ç |
Small c, cedilla |
| copy |
© |
Copyright |
| Eacute |
é |
Capital E, acute accent |
| eacute |
é |
Small e, acute accent |
| Ecirc |
ê |
Capital E, circumflex accent |
| ecirc |
ê |
Small e, circumflex accent |
| Egrave |
è |
Capital E, grave accent |
| egrave |
è |
Small e, grave accent |
| ETH |
ð |
Capital Eth, Icelandic |
| eth |
ð |
Small eth, Icelandic |
| Euml |
ë |
Capital E, dieresis or umlaut mark |
| euml |
ë |
Small e, dieresis or umlaut mark |
| Iacute |
í |
Capital I, acute accent |
| iacute |
í |
Small i, acute accent |
| Icirc |
î |
Capital I, circumflex accent |
| icirc |
î |
Small i, circumflex accent |
| Igrave |
ì |
Capital I, grave accent |
| igrave |
ì |
Small i, grave accent |
| Iuml |
ï |
Capital I, dieresis or umlaut mark |
| iuml |
ï |
Small i, dieresis or umlaut mark |
| Ntilde |
ñ |
Capital N, tilde |
| ntilde |
ñ |
Small n, tilde |
| Oacute |
ó |
Capital O, acute accent |
| oacute |
ó |
Small o, acute accent |
| Ocirc |
ô |
Capital O, circumflex accent |
| ocirc |
ô |
Small o, circumflex accent |
| Ograve |
ò |
Capital O, grave accent |
| ograve |
ò |
Small o, grave accent |
| Oslash |
ø |
Capital O, slash |
| oslash |
ø |
Small o, slash |
| Otilde |
õ |
Capital O, tilde |
| otilde |
õ |
Small o, tilde |
| Ouml |
ö |
Capital O, dieresis or umlaut mark |
| ouml |
ö |
Small o, dieresis or umlaut mark |
| reg |
® |
Registered TradeMark |
| Szlig |
ß |
Small sharp s, German (sz ligature) |
| THORN |
þ |
Capital THORN, Icelandic |
| thorn |
þ |
Small thorn, Icelandic |
| trade |
™ |
TradeMark |
| Uacute |
ú |
Capital U, acute accent |
| uacute |
ú |
Small u, acute accent |
| Ucirc |
û |
Capital U, circumflex accent |
| ucirc |
û |
Small u, circumflex accent |
| Ugrave |
ù |
Capital U, grave accent |
| ugrave |
ù |
Small u, grave accent |
| Uuml |
ü |
Capital U, dieresis or umlaut mark; |
| uuml |
ü |
Small u, dieresis or umlaut mark |
| Yacute |
ý |
Capital Y, acute accent |
| yacute |
ý |
Small y, acute accent |
| yuml |
ÿ |
Small y, dieresis or umlaut mark |
Numeric Character Entities
Numeric character entities are represented in an HTML document as entities whose name is the number sign (#) followed by a numeral from 32-126 and 161-255. The HTML DTD includes a numeric character entity for each of the printing characters of the ISO-8859-1 encoding, so that one may reference them by number if it is inconvenient to enter them directly. The following entity names are used, in HTML, always prefaced with ampersand (&) and followed by a semicolon.
Table of character entities
| Name |
Description |
Symbol |
| #00-#08 |
Unused |
N/A |
| #09 |
Horizontal tab |
N/A |
| #10 |
Line feed |
N/A |
| #11-#31 |
Unused |
N/A |
| #32 |
Space |
N/A |
| #33 |
Exclamation mark |
! |
| #34 |
Quotation mark |
" |
| #35 |
Number sign |
# |
| #36 |
Dollar sign |
$ |
| #37 |
Percent sign |
% |
| #38 |
Ampersand |
& |
| #39 |
Apostrophe |
' |
| #40 |
Left parenthesis |
( |
| #41 |
Right parenthesis |
) |
| #42 |
Asterisk |
* |
| #43 |
Plus sign |
+ |
| #44 |
Comma |
, |
| #45 |
Hyphen |
- |
| #46 |
Period (full stop) |
. |
| #47 |
Solidus (slash) |
/ |
| #48-#57 |
Digits 0-9 |
0-9 |
| #58 |
Colon |
: |
| #59 |
Semi-colon |
; |
| #60 |
Less than |
< |
| #61 |
Equals sign |
= |
| #62 |
Greater than |
> |
| #63 |
Question mark |
? |
| #64 |
Commercial at |
@ |
| #91 |
Left square bracket |
[ |
| #92 |
Reverse solidus (backslash) |
\ |
| #93 |
Right square bracket |
] |
| #94 |
Caret |
^ |
| #95 |
Horizontal bar |
_ |
| #96 |
Acute accent |
` |
| #97-#122 |
Letters a-z |
a-z |
| #123 |
Left curly brace |
{ |
| #124 |
Vertical bar |
| |
| #125 |
Right curly brace |
} |
| #126 |
Tilde |
~ |
| #127-#160 |
Unused |
N/A |
| #161 |
Inverted exclamation |
¡ |
| #162 |
Cent sign |
¢ |
| #163 |
Pound sterling |
£ |
| #164 |
General currency sign |
¤ |
| #165 |
Yen sign |
¥ |
| #166 |
Broken vertical bar |
¦ |
| #167 |
Section sign |
§ |
| #168 |
Umlaut (dieresis) |
¨ |
| #169 |
Copyright |
© |
| #170 |
Feminine ordinal |
ª |
| #171 |
Left angle quote, guillemot left |
« |
| #172 |
Not sign |
¬ |
| #173 |
Soft hyphen |
N/A |
| #174 |
Registered trademark |
® |
| #175 |
Macron accent |
¯ |
| #176 |
Degree sign |
° |
| #177 |
Plus or minus |
± |
| #178 |
Superscript two |
² |
| #179 |
Superscript three |
³ |
| #180 |
Acute accent |
´ |
| #181 |
Micro sign |
µ |
| #182 |
Paragraph sign |
¶ |
| #183 |
Middle dot |
· |
| #184 |
Cedilla |
¸ |
| #185 |
Superscript one |
¹ |
| #186 |
Masculine ordinal |
º |
| #187 |
Right angle quote, guillemot right |
» |
| #188 |
Fraction one-fourth |
¼ |
| #189 |
Fraction one-half |
½ |
| #190 |
Fraction three-fourths |
¾ |
| #191 |
Inverted question mark |
¿ |
| #192 |
Capital A, acute accent |
À |
| #193 |
Capital A, grave accent |
Á |
| #194 |
Capital A, circumflex accent |
 |
| #195 |
Capital A, tilde |
à |
| #196 |
Capital A, ring |
Ä |
| #197 |
Capital A, dieresis or umlaut mark |
Å |
| #198 |
Capital AE diphthong (ligature) |
Æ |
| #199 |
Capital C, cedilla |
Ç |
| #200 |
Capital E, acute accent |
È |
| #201 |
Capital E, grave accent |
É |
| #202 |
Capital E, circumflex accent |
Ê |
| #203 |
Capital E, dieresis or umlaut mark |
Ë |
| #204 |
Capital I, acute accent |
Ì |
| #205 |
Capital I, grave accent |
Í |
| #206 |
Capital I, circumflex accent |
Î |
| #207 |
Capital I, dieresis or umlaut mark |
Ï |
| #208 |
Capital Eth, Icelandic |
Ð |
| #209 |
Capital N, tilde |
Ñ |
| #210 |
Capital O, acute accent |
Ò |
| #211 |
Capital O, grave accent |
Ó |
| #212 |
Capital O, circumflex accent |
Ô |
| #213 |
Capital O, tilde |
Õ |
| #214 |
Capital O, dieresis or umlaut mark |
Ö |
| #215 |
Multiply sign |
× |
| #216 |
Capital O, slash |
Ø |
| #217 |
Capital U, acute accent |
Ù |
| #218 |
Capital U, grave accent |
Ú |
| #219 |
Capital U, circumflex accent |
Û |
| #220 |
Capital U, dieresis or umlaut mark |
Ü |
| #221 |
Capital Y, acute accent |
Ý |
| #222 |
Capital THORN, Icelandic |
Þ |
| #223 |
Small sharp s, German (sz ligature) |
ß |
| #224 |
Small a, acute accent |
à |
| #225 |
Small a, grave accent |
á |
| #226 |
Small a, circumflex accent |
â |
| #227 |
Small a, tilde |
ã |
| #228 |
Small a, dieresis or umlaut mark |
ä |
| #229 |
Small a, ring |
å |
| #230 |
Small ae diphthong (ligature) |
æ |
| #231 |
Small c, cedilla |
ç |
| #232 |
Small e, acute accent |
è |
| #233 |
Small e, grave accent |
é |
| #234 |
Small e, circumflex accent |
ê |
| #235 |
Small e, dieresis or umlaut mark |
ë |
| #236 |
Small i, acute accent |
ì |
| #237 |
Small i, grave accent |
í |
| #238 |
Small i, circumflex accent |
î |
| #239 |
Small i, dieresis or umlaut mark |
ï |
| #240 |
Small eth, Icelandic |
ð |
| #241 |
Small n, tilde |
ñ |
| #242 |
Small o, acute accent |
ò |
| #243 |
Small o, grave accent |
ó |
| #244 |
Small o, circumflex accent |
ô |
| #245 |
Small o, tilde |
õ |
| #246 |
Small o, dieresis or umlaut mark |
ö |
| #247 |
Division sign |
÷ |
| #248 |
Small o, slash |
ø |
| #249 |
Small u, acute accent |
ù |
| #250 |
Small u, grave accent |
ú |
| #251 |
Small u, circumflex accent |
û |
| #252 |
Small u, dieresis or umlaut mark |
ü |
| #253 |
Small y, acute accent |
ý |
| #254 |
Small thorn, Icelandic |
þ |
| #255 |
Small y, dieresis or umlaut mark |
ÿ |
Previous | Contents | Next
|
|