[jdom-interest] Building from file with UTF-8 extended characters

Fred Clewis clewisf at us.ibm.com
Thu Nov 15 07:34:31 PST 2001


I have an XML file with encoding="UTF-8" that is mostly one byte ASCII but
has one element text value that is a two-byte UTF-8 character  X"C595".
When I use "jdom b7 (+ recent CVS update), xerces 2.0 beta 2" to read the
file in and build a document with code like:

SAXBuilder builder = new SAXBuilder();
builder.setFeature("http://apache.org/xml/features/allow-java-encodings",
true);
builder.setValidation(false);
Document doc = builder.build(new FileInputStream(xmlFile));

I would expect the parser to change the 2 byte UTF-8 character, X"C595", to
it's unicode equivelent, X"0155".
Is that right?  How could I verify it's unicode value in java?  I need to
build and MQSeries message with it.  At the moment, I am not sure if I want
the MQSeries to be in Unicode or UTF-8 form, but I think I am not parsing
it in correctly.

Attempts to write it out before sending with:

1.  doc.toString()
or
2. ByteArrayOutputStream baos = new ByteArrayOutputStream();
  XMLOutputter xmlOut = new XMLOutputter();
  xmlOut.output(doc, baos);
  return baos.toString();

seem to indicate it is still treated as two seperate bytes "C5" "95"

thanks for any ideas,




More information about the jdom-interest mailing list