[jdom-interest] Building from file with UTF-8 extended characters

Jason Hunter jhunter at acm.org
Mon Nov 19 21:04:46 PST 2001


Count the characters using getText().length().  That will let you know
if it's being stored as a single Unicode char or not.

After output the document will be in UTF-8 again so you'll see the char
converted to the 2-byte sequence, so seeing two bytes there doesn't mean
it failed.

-jh-

Fred Clewis wrote:
> 
> I have an XML file with encoding="UTF-8" that is mostly one byte ASCII but
> has one element text value that is a two-byte UTF-8 character  X"C595".
> When I use "jdom b7 (+ recent CVS update), xerces 2.0 beta 2" to read the
> file in and build a document with code like:
> 
> SAXBuilder builder = new SAXBuilder();
> builder.setFeature("http://apache.org/xml/features/allow-java-encodings",
> true);
> builder.setValidation(false);
> Document doc = builder.build(new FileInputStream(xmlFile));
> 
> I would expect the parser to change the 2 byte UTF-8 character, X"C595", to
> it's unicode equivelent, X"0155".
> Is that right?  How could I verify it's unicode value in java?  I need to
> build and MQSeries message with it.  At the moment, I am not sure if I want
> the MQSeries to be in Unicode or UTF-8 form, but I think I am not parsing
> it in correctly.
> 
> Attempts to write it out before sending with:
> 
> 1.  doc.toString()
> or
> 2. ByteArrayOutputStream baos = new ByteArrayOutputStream();
>   XMLOutputter xmlOut = new XMLOutputter();
>   xmlOut.output(doc, baos);
>   return baos.toString();
> 
> seem to indicate it is still treated as two seperate bytes "C5" "95"
> 
> thanks for any ideas,
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com



More information about the jdom-interest mailing list