[jdom-interest] Fwd: Re: Kana symbols and UTF-8? (was Re: Kanacharacters?)

Michael Kay mike at saxonica.com
Wed May 23 01:30:20 PDT 2007

> To clear confusion, the symbols used in the <hiraganaSym> 
> tags are actual fonts of the UTF-8 hexadecimal value. 

I'm sorry, but that kind of language causes far more confusion than it
clears. You're using words like symbol, font, and tag quite inaccurately.

Your XML document is a sequence of bytes or octets. The encoding of the
document determines the mapping of these octets to Unicode characters, so if
the encoding is UTF-8 then a sequence of three particular octets might
represent the character whose Unicode name is "HIRAGANA LETTER HA", which is
assigned to the codepoint hexadecimal x306F (=decimal 12399). A font is a
mapping from characters to glyphs (visible representations of characters on
screen or paper). So to get from a sequence of octets in your file to
something you see on the screen, you first use the encoding to translate the
octets to characters, and you then use a font to translate the characters to

In XML, you can always represent a character using a character reference,
for example HIRAGANA LETTER HA can be represented as &#x306F; or as
&#12399;. This is useful if you don't have a keyboard that lets you enter
the character directly, and it also has the advantage that it protects you
from errors in applying the encoding. But it doesn't help you with font
difficulties: if you use a font that has no glyph for a given character,
then it will usually be displayed in some kind of fallback representation,
for example a hollow rectangle.

Michael Kay

More information about the jdom-interest mailing list