[jdom-interest] Fwd: XML 1.1 -- Please stab me with a dull knife and trample my dead body

Bjorn Roche bjorn at xowave.com
Fri Sep 7 15:27:28 PDT 2012

On Sep 7, 2012, at 4:43 PM, Canadian Wilf wrote:

> I can do this:
> String random = new String(someRandomByte[]) 

Let me address this by pointing out a degenerate case. Strings in java are terminated by the null char (er, I think. Wow, it's been a while since I learned this insanely basic thing). If your someRandomBytes contains two consecutive zero bytes (= a single zero char), then the string "random" will obviously not be what you wanted, because it will end early -- if you are lucky. Another example is if the "someRandomByte" ends in the first half of a unicode codepoint. What happens then? So, yes you can construct a string from a byte array like you did here but please don't! RTFM: "The behavior of this constructor when the given bytes are not valid in the default charset is unspecified." Unspecified. As in "it might delete your hard drive, log on to facebook and unfriend your wife." That's what unspecified means, so those bytes need to be "sanitized" too.

If that's the kind of data you want to put in XML (raw, random-assed binary), use Base64!

> However, the string cannot be passed to the Text of an XML Element since it may contain illegal characters (<= 0X20 ascii, vertical tab, etc.) This will fail:
> new Element("test").setText(random)
> XOM and JDOM both restrict the access and will throw IllegalDataException if one of the characters (0x--0xFFFF) is not in XML Unicode specs.

First off, I think maybe you should read this because we are not talking about 0x0 to 0xFFFF: http://www.joelonsoftware.com/articles/Unicode.html

Secondly, yes there are values that must be escaped in XML. For example < and > for obvious reasons, but the library does this for you. Then there are values you can't put into XML at all. These fall into other categories. "not valid in a string" (eg the NULL character usually used as a string terminator) is one. Yes, that's right, you can't put 0x00 in an XML string, 'cause you can't put it in a string! OMG! Stop the presses! I also find this annoying, and have been bitten by it (I think it was 0x17 or something), but that's life.

I agree, however, it would be nice to have some clarity on exactly what's allowed.

When in doubt, use Base64!

Or create sub elements for the weird chars, just like html does for, say, newlines: <br />


Bjorn Roche
Audio Collaboration

More information about the jdom-interest mailing list