[jdom-interest] XMLOutputter problems with Unicode

Mad Einstein madeinstein at hotmail.com
Tue Jul 2 09:06:59 PDT 2002


Current XMLOutputter class (Version 8) doesn't support Unicode characters with hashcode above 128.

I was trying to save this character \u8220 to xml using XMLOutputter and as the result I had in file one byte (93hex) instead of two bytes, and then I couldn't parse this file using SAXBuilder as well as I couldn't open this file in Internet Explorer.

I was reading different algorithms that converts Unicode to XML, HTML and I think this one is the best 


--------------------------------------------------------------------------------

http://czyborra.com/utf/#UTF-8
HTML's Numerical Character References
A somewhat more standardized encoding option is specified by HTML. RFC 2070 allows us to reference just any Unicode character within any HTML document of any charset by using the decimal numeric character reference 〹 as in: 

putwchar(c)
{
  if (c < 0x80 && c != '&' && c != '<') putchar(c);
  else printf ("&#%d;", c);
}

Decimal numbers for Unicode characters are also used in Windows NT's Alt-12345 input method but are still of so little mnemonic value that a hexadecimal alternative &#x1bc; is being supported by the newer standards HTML 4.0 and XML 1.0. Apart from that, hexadecimal numbers aren't that easy to memorize either. SGML has long allowed symbolic character entities for some character references like &eacute; for é and &euro; for the € but the table of supported entities differs from browser to browser. 


--------------------------------------------------------------------------------


I wrote this method for the conversion

This class converts this 3 characters (&,<,>) to SGML Entities as well as all characters above 128 using this format &#1234; Now it works with any parsers suporting XML 1.0

/**
 * Converts Unicode Character to HTML Decimal Entity.
 * All Characters with hashcode less than 128(decimal) apart from
 * '>','<' and '&' are the same.. The rest is converted to decimal entity &#{char_hashcode};
 * Supported formats examples:
 * <br> /u003F  --> &#63;
 * @param value Unicode Character
 * @return Converted HTML Character or Entity.
 */
  public String convertTEXTtoHTML(char value)
  {
     String temp = null;
     char b[] = new char[1];
     int bint = new Character(value).hashCode();
     if((bint<128)&&(bint!="&".hashCode())&&(bint!="<".hashCode())&&(bint!=">".hashCode()))
     {
//       b[0] = value;
//       temp = new String(b);
       temp = null;
     }
     else
      temp = "&#"+ bint +";";
     return temp;
  }

and I changed XMLOutputter.escapeElementEntities(String str) method 

   default :
       entity = convertTEXTtoHTML(ch);
       break;

Maybe there is a different solution for this problem, but It works fine.

Mad Einstein

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://jdom.org/pipermail/jdom-interest/attachments/20020702/77085f44/attachment.htm


More information about the jdom-interest mailing list