[jdom-interest] Re: Getting original Encodin g and changing the
d efau lt UTF-8
Jason Hunter
jhunter at xquery.com
Fri Sep 10 01:19:22 PDT 2004
If you get a document with mixed Shift_JIS and Latin-1 content you don't
have many output choices. UTF-8, UTF-16, UCS-2, etc. Your best bet is
usually UTF-8. That's why XML parsers assume it by default and why
XMLOutputter outputs it by default. If you have a good parser and use
UTF-8 encoding, it won't bomb out on any of the basic Unicode
(pre-surrogate pair) chars.
-jh-
Young Matthew wrote:
> hej,
>
> Exactly. We have child documents that get included which should have a certain
> ISO encoding but don't. Then the default takes over and the swedish characters
> bomb the parser.
>
> Simplest thing is to demand our projects to deliver documents with the correct
> encoding.
>
> Find it odd that when transforming with XSLT (say Xalan) that the encoding of
> the style sheet overides all of the input XML documents. Seems like XML
> parsers should apply the same principle with "included" child documents to a
> parent XML. If the main XML says the encoding should be XYZ then regardless of
> what is stated in the headers of subdocuments the document gets translated
> with XYZ encoding.
>
>
> / Matthew
> Jason Hunter (2004-09-10 09:53):
> Young Matthew wrote:
>
>
>>hej,
>>
>>Regarding the default encoding I more thinking on the front end and not with
>>printing. In other words before parsing a document it would be cool if I
>
> could
>
>>shift the encoding to someother than UTF-8 to handle svenska characters.
>
>
> XML files generally have their encoding listed in the declaration if
> they're not UTF-8. So the parser automatically can determine the proper
> encoding to use. Getting the data in correctly isn't an issue; the
> issue arises if you want to encode the document the same way on output
> instead of using the universal UTF-8 encoding. SAX doesn't report what
> the original encoding was, just returns the already-decoded characters.
>
> Another builder, like an XNI builder, could report the encoding. The
> Document class doesn't currently have an encoding property but we could
> add one if we had a parser that reported it. That is, assuming it's a
> document-level notion. The story's less clear when pulling together
> elements from multiple documents. If the original Document node was
> Latin-1 but you included an Element from a Shift_JIS document, you can't
> reliably assume Latin-1 for the new document.
>
> -jh-
>
>
>
More information about the jdom-interest
mailing list