[jdom-interest] Re: Getting original Encodin g and changing the d efau lt UTF-8

Jason Hunter jhunter at xquery.com
Fri Sep 10 01:19:22 PDT 2004

If you get a document with mixed Shift_JIS and Latin-1 content you don't 
have many output choices.  UTF-8, UTF-16, UCS-2, etc.  Your best bet is 
usually UTF-8.  That's why XML parsers assume it by default and why 
XMLOutputter outputs it by default.  If you have a good parser and use 
UTF-8 encoding, it won't bomb out on any of the basic Unicode 
(pre-surrogate pair) chars.


Young Matthew wrote:

> hej,
> Exactly.  We have child documents that get included which should have a certain
> ISO encoding but don't.  Then the default takes over and the swedish characters
> bomb the parser.
> Simplest thing is to demand our projects to deliver documents with the correct
> encoding.
> Find it odd that when transforming with XSLT (say Xalan) that the encoding of
> the style sheet overides all of the input XML documents.  Seems like XML
> parsers should apply the same principle with "included" child documents to a
> parent XML.  If the main XML says the encoding should be XYZ then regardless of
> what  is stated in the headers of subdocuments the document gets translated
> with XYZ encoding.
> / Matthew
> Jason Hunter  (2004-09-10  09:53):
> Young Matthew wrote:
>>Regarding the default encoding I more thinking on the front end and not with
>>printing.  In other words before parsing a document it would be cool if I
> could
>>shift the encoding to someother than UTF-8 to handle svenska characters.
> XML files generally have their encoding listed in the declaration if
> they're not UTF-8.  So the parser automatically can determine the proper
> encoding to use.  Getting the data in correctly isn't an issue; the
> issue arises if you want to encode the document the same way on output
> instead of using the universal UTF-8 encoding.  SAX doesn't report what
> the original encoding was, just returns the already-decoded characters.
> Another builder, like an XNI builder, could report the encoding.  The
> Document class doesn't currently have an encoding property but we could
> add one if we had a parser that reported it.  That is, assuming it's a
> document-level notion.  The story's less clear when pulling together
> elements from multiple documents.  If the original Document node was
> Latin-1 but you included an Element from a Shift_JIS document, you can't
> reliably assume Latin-1 for the new document.
> -jh-

