[jdom-interest] Internal DTD subset verification

Tue Apr 30 16:24:38 PDT 2002

Ugh, verification is getting more and more bloated. Let me make one more
attempt at this.

Currently we're verifying the data in every JDOM object that gets created.
It's obvious that this is wasteful. The parser is required to check every
character of the document for validity, and we're checking everything again
afterwards. Ugh.

If we ignore that, and assume that we come up with some mechanism to turn
off verification for parsing, then we're left with verification when you're
building JDOM objects programmatically. How often is this verification
useful? First let's think about names. Very commonly, the names of elements
and attributes are fixed - e.g. stored as String constants in a program. If
this program creates 1000 documents, these names will currently be checked
1000 times, even though they never change. Worse, each name may get checked
dozens or hundreds of times in each document, if there are repeating
elements or attributes. (We're talking about the exact same String object,
getting checked over and over again.) The document type you're building may
have an internal DTD subset that you've stored as a constant String. Now
we're saying that in an "ideal" world, this String would actually be parsed
every time? Ugh.

Now let's consider text content, in attribute values and element content.
This content is variable much more frequently - it may even come from user
input. But the checks we do on this content is just that it contains legit
XML characters - that it doesn't contain any nulls, or vertical tabs, or
invalid Unicode characters. I can't even imagine how you'd write an app that
accidentally allowed any of these characters to sneak in.  The vast majority
of the time, this'll just be a waste.

Presumably this XML that's created will eventually be parsed. So all this
text that we've just verified will soon be re-verified by the parser, which
is "required by law" to fail if there are bad characters. So what's the
benefit of JDOM's verification? The programmer gets an error message as soon
as the bad text is added, rather than later when it's parsed. This is a
benefit - detecting errors sooner rather than later is always good. But the
problem is that this verification will continue to happen, even after you've
fixed the bug. In the common case where the names are fixed, you're adding
runtime code to fix compile-time bugs.

Two of the philosophies of the design of C++ were "you don't pay for stuff
you don't use", and "trust the programmer". These aren't quite as central to
the philosophy of Java, but I think they're still useful to consider, and it
seems like we're almost going out of our way here to break these rules.

The other solution would be to make the verifier optional, so you can run it
on your whole document before you output it, if you want. True, many people
wouldn't run it, but at some point we've got to trust the programmer.
Besides, usually the worst that happens is that the programmer will discover
the error as soon as the document is parsed, which almost always isn't too
much later. It's only the uncommon case where element and attribute names
are not fixed (e.g. they come from user input), that this might actually
catch runtime bugs. In that case, I think we have to trust the programmer to
verify the input themselves, or to run the JDOM verifier on the document
before outputting it. Otherwise you're making everyone pay for something
that will only benefit a minority of programmers.

Alex

> -----Original Message-----
> From: jdom-interest-admin at jdom.org
> [mailto:jdom-interest-admin at jdom.org]On Behalf Of Elliotte
> Rusty Harold
> Sent: Monday, April 29, 2002 5:51 PM
> To: jdom-interest
> Subject: [jdom-interest] Internal DTD subset verification
>
>
> I'm bothered by the failure to check the internal DTD subset for
> well-formedness. I'd like to add a checkInternalDTDSubset() method to
> the Verifier class. Initially this could call checkCharacterData() to
> make sure the internal subset doesn't contain any illegal characters
> like null or vertical tab. Longer term, however, we could add
> methods to
> actually verify tht it's a well-formed collection of declarations. We
> could probably piggy-back off of work done elsewhere for
> reading these
> things such as Mark Wutka's DTDParser
> <http://www.wutka.com/dtdparser.html> or the equivalent code
> in Xerces
> or Crimson.
>
> Thoughts? Is this worth doing?
>