[jdom-interest] SAXBuilder enhancement request /2

Dennis Sosnoski dms at sosnoski.com
Fri Mar 29 19:19:51 PST 2002


I've advocated this approach for document models and I'm glad to see it 
present in dom4j, but I absolutely agree that this should not be the 
default!

Most XML usage in Java programs is data centric, though, with whitespace 
between elements used only for convenient formatting. A compliant XML 
parser gives you the whitespace as content (unless you're validating, in 
which case whitespace separating elements may be reported as ignorable 
whitespace), resulting in a lot of extra components in the document 
tree. This adds substantial overhead without contributing anything 
useful as far as the application is concerned.

Many applications also want to ignore leading and trailing whitespace in 
character data content. An example of this type of usage is the web.xml 
file used by servlet applications. Recent versions of the spec require 
implementations to strip all leading and trailing whitespace from the 
content of elements.

I'd personally recommend two options - one to discard character data 
sequences consisting only of whitespace, the other to strip leading and 
trailing whitespace from character data content. It could also be done 
using a filter, as ERH suggests, though this might be a little more 
complicated - for stripping trailing whitespace you'd need to make sure 
you have the entire character data sequence available, rather than just 
a portion.

It's worth noting that EXML was silently deleting whitespace between 
elements for a very long time without any of its users complaining, as 
far as I know. I finally started pointing this out to people in my 
performance comparisons because it makes EXML's results look much better 
than the code justifies.

  - Dennis

Elliotte Rusty Harold wrote:

> At 8:41 AM +0100 3/29/02, phil at triloggroup.com wrote:
>
>> After looking at DOM4J, it appears that these guys added this 
>> capability recently ("stripWhitespaceText"). This is
>> effectively very convenient when dealing with data centric document.
>> Can we add it to JDOM?
>>
>
> This makes me very nervous. It's a common misconception that white 
> space is insignificant in XML. It's not.
>
> As long as the default is to keep all space, and throwing it away 
> requires an explicit client choice, I can live with this, but please 
> put big warnings about it in the JavaDoc.
>
> And you'd have to define very carefully what space is kept and what is 
> not and document your choice. For instance, do you want to throw away 
> all white space? All white-space only text nodes? All ignorable white 
> space? These are three different things.
>
> Another thought: maybe what's needed is a more generic builder filter 
> operation that could do this and a lot more? SAX filters could 
> certainly handle it.






More information about the jdom-interest mailing list