[jdom-interest] A suggested performance improvement

Alex Rosen arosen at novell.com
Mon Mar 17 18:46:49 PST 2003


Very interesting. I guess your document has lots of text content in it?

What platform and VM are you running on? It's too bad HotSpot doesn't
inline isXMLCharacter, I guess it's too big. Making it final doesn't
help does it? 

Anyway, your suggestion seems like a good idea, though a bummer that we
have to do it.

BTW - the last two "if"s in isXMLCharacter are useless, since a char
can never be more than FFFF.

Which brings up another point. If I understand things correctly, JDK
1.5 will support Unicode characters larger than FFFF, which will
probably be represented by surrogate pairs, so all these isXML...
methods will need to be completely revamped at that time. (You won't be
able to check for a valid character by checking just one char.) What a
mess.

Plus, if we ever want to support XML 1.1 when it comes out, we'll need
to figure out what to do with Verifier again - we'll need two different
versions then. If it weren't for Verifier, all we'd need to deal with is
outputting version="1.1".

Yup... I really hate the Verifier.

Alex

>>> Tom Oke <tomo at elluminate.com> 3/16/2003 8:18:44 PM >>>
I have noticed, on large XML files, that the majority of the CPU time
is going into the routines: Verifier.isXMLCharacter and 
Verifier.checkCharacterData.

I had initially modified isXMLCharacter to have it check the most
likely range of data first, to get a short exit, and this took off
about 25% of the CPU used in some large files, for the JDOM read.

However, in the thread doing the JDOM input, 62% of the time
was still in isXMLCharacter and 16% was in checkCharacterData,
which calls isXMLCharacter.

The biggest bang for the buck was by enclosing the 
if statement with isXMLCharacter with a test for the 
most likely good range. This is seen below in the two
lines:

            char c = text.charAt(i);
            if (!(c > 0x1F && c < 0xD800)) {

This reduced checkCharacterData to 1.32% of the thread use,
and isXMLCharacter doesn't really show up at all.

Hopefully this is a reasonable change to submit to JDOM?

What follows is the full code for Verifier.checkCharacterData.



    public static final String checkCharacterData(String text) {
        if (text == null) {
            return "A null is not a legal XML value";
        }

        // do check
        for (int i = 0, len = text.length(); i<len; i++) {
            char c = text.charAt(i);
            if (!(c > 0x1F && c < 0xD800)) {
                if (!isXMLCharacter(text.charAt(i))) {
                    // Likely this character can't be easily displayed
                    // because it's a control so we use it'd
hexadecimal
                    // representation in the reason.
                    return ("0x" + Integer.toHexString(text.charAt(i))
                            + " is not a legal XML character");
                }
            }
        }

        // If we got here, everything is OK
        return null;
    }

Tom Oke
_______________________________________________
To control your jdom-interest membership:
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com



More information about the jdom-interest mailing list