[jdom-interest] XMLOutputter problems with Unicode

Ian Lea ian at digimem.net
Wed Jul 3 03:04:42 PDT 2002


FileWriter uses the default encoding.  Try using
OutputStreamWriter and FileOutputStream.


--
Ian.
ian at digimem.net


> madeinstein at hotmail.com (Mad Einstein) wrote 
>
> I tryied to do this like that:
> 
>     Element root = new Element("indexes");
>     Document doc = new Document(root);   //sample JDom Document
> 
>     FileWriter fw = new FileWriter("test.xml",false);
>     new XMLOutputter(" ", true, "UTF-8").output(doc,fw);
> 
> And the result was as I said one byte 93hex insead of  \u8220
> 
> Should I use different writer? Do you know any writers that will give me
> proper Unicode output?
> 
> Thanks,
> 
> Mad Einstein
> 
> ----- Original Message -----
> From: "Jason Hunter" <jhunter at servlets.com>
> To: "Mad Einstein" <madeinstein at hotmail.com>
> Cc: <jdom-interest at jdom.org>
> Sent: Tuesday, July 02, 2002 8:06 PM
> Subject: Re: [jdom-interest] XMLOutputter problems with Unicode
> 
> 
> > Your solution is one approach.  However, if you simply leave the
> > outputter's encoding as UTF-8 (the default) and pass in an output stream
> > or a writer designed for UTF-8, then characters are encoded correctly
> > without needing to be escaped.  That should be faster than your
> > solution.  If you don't see that happening, you probably passed in an
> > improper writer or changed the encoding.
> >
> > -jh-
> >
> > > Mad Einstein wrote:
> > >
> > > ???
> > > Current XMLOutputter class (Version 8) doesn't support Unicode
> > > characters with hashcode above 128.
> > >
> > > I was trying to save this character \u8220 to xml using XMLOutputter
> > > and as the result I had in file one byte (93hex) instead of two bytes,
> > > and then I couldn't parse this file using SAXBuilder as well as I
> > > couldn't open this file in Internet Explorer.
> > >
> > > I was reading different algorithms that converts Unicode to XML, HTML
> > > and I think this one is the best
> > >
> > > ----------------------------------------------------------------------
> > > http://czyborra.com/utf/#UTF-8
> > >
> > > HTML's Numerical Character References
> > >
> > > A somewhat more standardized encoding option is specified by HTML. RFC
> > > 2070 allows us to reference just any Unicode character within any HTML
> > > document of any charset by using the decimal numeric character
> > > reference ? as in:
> > >
> > > putwchar(c)
> > > {
> > >   if (c < 0x80 && c != '&' && c != '<') putchar(c);
> > >   else printf ("&#%d;", c);
> > > }
> > >
> > > Decimal numbers for Unicode characters are also used in Windows NT's
> > > Alt-12345 input method but are still of so little mnemonic value that
> > > a hexadecimal alternative ? is being supported by the newer
> > > standards HTML 4.0 and XML 1.0. Apart from that, hexadecimal numbers
> > > aren't that easy to memorize either. SGML has long allowed symbolic
> > > character entities for some character references like ? for ??
> > > and ? for the ?,? but the table of supported entities differs
> > > from browser to browser.
> > >
> > > ----------------------------------------------------------------------
> > >
> > > I wrote this method for the conversion
> > >
> > > This class converts this 3 characters (&,<,>) to SGML Entities as well
> > > as all characters above 128 using this format ? Now it works
> > > with any parsers suporting XML 1.0
> > >
> > > /**
> > >  * Converts Unicode Character to HTML Decimal Entity.
> > >  * All Characters with hashcode less than 128(decimal) apart from
> > >  * '>','<' and '&' are the same.. The rest is converted to decimal
> > > entity &#{char_hashcode};
> > >  * Supported formats examples:
> > >  * <br> /u003F  --> ?
> > >  * @param value Unicode Character
> > >  * @return Converted HTML Character or Entity.
> > >  */
> > >   public String convertTEXTtoHTML(char value)
> > >   {
> > >      String temp = null;
> > >      char b[] = new char[1];
> > >      int bint = new Character(value).hashCode();
> > >
> > >
> if((bint<128)&&(bint!="&".hashCode())&&(bint!="<".hashCode())&&(bint!=">".ha
> shCode()))
> > >      {
> > > //       b[0] = value;
> > > //       temp = new String(b);
> > >        temp = null;
> > >      }
> > >      else
> > >       temp = "&#"+ bint +";";
> > >      return temp;
> > >   }
> > >
> > > and I changed XMLOutputter.escapeElementEntities(String str) method
> > >
> > >    default :
> > >        entity = convertTEXTtoHTML(ch);
> > >        break;
> > >
> > > Maybe there is a different solution for this problem, but It works
> > > fine.
> > >
> > > Mad Einstein

----------------------------------------------------------------------
Searchable personal storage and archiving from http://www.digimem.net/


More information about the jdom-interest mailing list