[jdom-interest] CDATA inconsistency

Malachi de AElfweald malachi at tremerechantry.com
Sat Nov 2 21:22:02 PST 2002


Ok, so the real issue is that the characters are being added as
binary data instead of as Java chars then? That would mean that
grabbing the data from the initial source would be the problem?

So, if the original data were grabbed via a BufferedReader via
an InputStreamReader using the correct encoding, would that not
make sure that the data had the correct surrogate pairs internally,
since they would all be valid Java characters?

Malachi



11/2/2002 3:33:25 PM, Elliotte Rusty Harold <elharo at metalab.unc.edu> wrote:

>At 11:32 AM -0800 11/2/02, Malachi de AElfweald wrote:
>>"unmatched halves of surrogate pairs".... That would be assuming 
>>UTF-8 specifically,
>>would it not? ISO-8859-1, for example, does not have surrogate pairs.
>>
>
>No, it's assuming Java. A Java char is *not* a Unicode character. It 
>is a UTF-16 code point. In UTF-16 (UTF-8 does not use surrogate 
>pairs), characters from outside the basic Multilingual Plane (BMP) 
>are represented as two consecutive surrogate characters, an upper 
>half and a lower half.  (I can never remember which is which.) 
>However, neither Java nor JDOM does any checking to make sure the 
>surrogates match up like they're supposed to.  It just assumes each 
>char is legal.
>-- 






More information about the jdom-interest mailing list