HTML and CSS Reference
In-Depth Information
expression should find most documents that do not have a lang or xml:lang attribute on their root html
element:
<html\s+((id|class|dir|xmlns)\s*=\s*("[^"]+"|'[^']+')\s*)*>
This will find most cases you need to deal with.
The language codes themselves are standardized in the IANA Language Subtag Registry. Where possible, you
should use the standard two-letter codes as shown in Table 6.3 . This is an abbreviated list. You can find the full
list at www.iana.org/assignments/language-subtag-registry .
Table 6.3. Common Language Codes
Language
Code
Amharic
am
Arabic
ar
Czech
cs
German
de
Greek
el
English
en
Esperanto
eo
Spanish
es
French
fr
Hindi
hi
Indonesian
id
Italian
it
Japanese
ja
Korean
ko
Dutch
nl
Portuguese
pt
Russian
ru
Vietnamese
vi
Chinese
zh
Although there are many more codes than I've shown in Table 6.3 , there are even more languages on the
planet (about 6,000) than there are two-letter codes. Less common languages now use three-letter codes. For
instance, Coptic has the code cop. There are also dialect subcodes you can use. For example, en-US is English
as spoken in the United States, whereas en-GB is English as spoken in Great Britain. This might matter a little to
search engines or spell checkers. However, getting this right is not nearly as important as identifying the
primary language.
Although they are redundant, you should include both lang and xml:lang attributes, at least for now. Older
 
Search WWH ::




Custom Search