[jdom-interest] XMLOutputter problems with Unicode

Mad Einstein madeinstein at hotmail.com
Wed Jul 3 02:29:36 PDT 2002


I tryied to do this like that:

    Element root = new Element("indexes");
    Document doc = new Document(root);   //sample JDom Document

    FileWriter fw = new FileWriter("test.xml",false);
    new XMLOutputter(" ", true, "UTF-8").output(doc,fw);

And the result was as I said one byte 93hex insead of  \u8220

Should I use different writer? Do you know any writers that will give me
proper Unicode output?

Thanks,

Mad Einstein

----- Original Message -----
From: "Jason Hunter" <jhunter at servlets.com>
To: "Mad Einstein" <madeinstein at hotmail.com>
Cc: <jdom-interest at jdom.org>
Sent: Tuesday, July 02, 2002 8:06 PM
Subject: Re: [jdom-interest] XMLOutputter problems with Unicode


> Your solution is one approach.  However, if you simply leave the
> outputter's encoding as UTF-8 (the default) and pass in an output stream
> or a writer designed for UTF-8, then characters are encoded correctly
> without needing to be escaped.  That should be faster than your
> solution.  If you don't see that happening, you probably passed in an
> improper writer or changed the encoding.
>
> -jh-
>
> > Mad Einstein wrote:
> >
> > 
> > Current XMLOutputter class (Version 8) doesn't support Unicode
> > characters with hashcode above 128.
> >
> > I was trying to save this character \u8220 to xml using XMLOutputter
> > and as the result I had in file one byte (93hex) instead of two bytes,
> > and then I couldn't parse this file using SAXBuilder as well as I
> > couldn't open this file in Internet Explorer.
> >
> > I was reading different algorithms that converts Unicode to XML, HTML
> > and I think this one is the best
> >
> > ----------------------------------------------------------------------
> > http://czyborra.com/utf/#UTF-8
> >
> > HTML's Numerical Character References
> >
> > A somewhat more standardized encoding option is specified by HTML. RFC
> > 2070 allows us to reference just any Unicode character within any HTML
> > document of any charset by using the decimal numeric character
> > reference &#12345; as in:
> >
> > putwchar(c)
> > {
> >   if (c < 0x80 && c != '&' && c != '<') putchar(c);
> >   else printf ("&#%d;", c);
> > }
> >
> > Decimal numbers for Unicode characters are also used in Windows NT's
> > Alt-12345 input method but are still of so little mnemonic value that
> > a hexadecimal alternative &#x1bc; is being supported by the newer
> > standards HTML 4.0 and XML 1.0. Apart from that, hexadecimal numbers
> > aren't that easy to memorize either. SGML has long allowed symbolic
> > character entities for some character references like &eacute; for é
> > and &euro; for the â,¬ but the table of supported entities differs
> > from browser to browser.
> >
> > ----------------------------------------------------------------------
> >
> > I wrote this method for the conversion
> >
> > This class converts this 3 characters (&,<,>) to SGML Entities as well
> > as all characters above 128 using this format &#1234; Now it works
> > with any parsers suporting XML 1.0
> >
> > /**
> >  * Converts Unicode Character to HTML Decimal Entity.
> >  * All Characters with hashcode less than 128(decimal) apart from
> >  * '>','<' and '&' are the same.. The rest is converted to decimal
> > entity &#{char_hashcode};
> >  * Supported formats examples:
> >  * <br> /u003F  --> &#63;
> >  * @param value Unicode Character
> >  * @return Converted HTML Character or Entity.
> >  */
> >   public String convertTEXTtoHTML(char value)
> >   {
> >      String temp = null;
> >      char b[] = new char[1];
> >      int bint = new Character(value).hashCode();
> >
> >
if((bint<128)&&(bint!="&".hashCode())&&(bint!="<".hashCode())&&(bint!=">".ha
shCode()))
> >      {
> > //       b[0] = value;
> > //       temp = new String(b);
> >        temp = null;
> >      }
> >      else
> >       temp = "&#"+ bint +";";
> >      return temp;
> >   }
> >
> > and I changed XMLOutputter.escapeElementEntities(String str) method
> >
> >    default :
> >        entity = convertTEXTtoHTML(ch);
> >        break;
> >
> > Maybe there is a different solution for this problem, but It works
> > fine.
> >
> > Mad Einstein
> _______________________________________________
> To control your jdom-interest membership:
>
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhos
t.com
>



More information about the jdom-interest mailing list