[jdom-interest] Converting elements from a SAX stream to JDOM elements

Colin Horne colin at cdfh.org.uk
Mon Jul 20 01:32:30 PDT 2009


Hello,

Thanks for the feedback, Rolf.

I'm afraid I've yet to try your suggestion. Just before you replied, I
tried solving the problem using a different approach. I wrote a
dynamic proxy class which implemented DefaultHandler, and forwarded
all messages to another DefaultHandler-extending class, which had an
isInterested()/boolean method. The proxy checks the isInterested()
method after entering an element, and if it returns true, it replays
all the methods already sent (since the startElement() method), and
then forwards all future methods to a new SAXHandler, until it reaches
the end of the element. It then forwards the resulting Document to an
interested class.

Unfortunately, this didn't work. I've never implemented a dynamic
proxy in Java before, and it seems that they can only implement
interfaces, not classes. Since DefaultHandler is a class, Java refused
to typecast my proxy. I tried various methods such as creating my own
DefaultHandlerInterface (which extends the same interfaces as
DefaultHandler, but is an interface), overriding the appropriate
methods to support it. I'm afraid that in the end, I gave up on this
approach.

If I understand your suggestion correctly, then I have a couple of
worries before implementing it. My understanding of your suggestion is
to synchronise startElement()/endElement() with the InputStream (i.e.,
when start element is called, to call another method (on a class which
extends InputStream), startDuplicatingStream(), which returns a new
InputStream, which mirrors the original input stream until the
controlling class sees the approapriate endElement() method).

My concerns are: what if SAX is performing internal buffering of some
sort? In this case, it would not be possible to know where exactly to
start mirroring the InputStream. I noticed that DefaultHandler has a
getLocator() method, but it only returns the position in terms of
column and line numbers. Whether or not the above is the case, there
is no documented guarantee that the internal buffering situation will
be consistent in future releases, or other implementing classes.

My other concern is that after startElement() is called, the
InputStream will have already parsed the <element ...> tag, and so the
InputStream needs to know how far back to start mirroring. As far as I
can tell, the only way to do that would be to have another proxy
class, which marks the InputStream, and thus the previous mark before
startElement() was called should be just before the '<'. For the
reasons described above, the proxy class cannot (I think?) be
automatic, and would require me to manually use the same code for each
method implemented by DefaultHandler (which isn­'t significant in
terms of effort, but looks a bit messy :-) ).

Please do tell me if I have the wrong end of the stick. I'm afraid
that for the time being, I'm going to resort to serialising the XML
elements. Should I implement a solution to this problem in the future,
I'll send the code to the list.

I think that this problem would do well to be documented on the FAQ,
since I imagine that it is not uncommon.

Cheers,
  Colin

2009/7/18 jdom <jdom at tuis.net>:
> Colin.
>
> My instinct would be to investigate a different approach...
>
> Perhaps a mechanism similar to how ZipInputStreams work, where the
> stream can be read as separate streams for each 'element'.
>
> Build a 'tee' or 'branched' custom InputStream between the main sax
> parser, and the underlying 'infinite' stream. This intermediate stream
> can be used to feed 'child' streams to the JDOM's sax parser, but use a
> standard sax parser to terminate the child stream using a mechanism
> similar to what you described below.
>
> This way you have just one 'infinite' stream, and you feed the contents
> to one 'global' parser which implements 'break logic' on a seperate
> version of the stream which feeds JDOM. When the end of the element is
> encountered in the main stream it causes the JDOM stream to reach 'end
> of file', and the JDOM side of things can then open a new 'child' stream
> for the next 'document'.
>
> No (little) memory overhead. No need to buffer complete documents, etc.
>
> InputStreams are relatively simple to implement ;-)
>
> Rolf
>
>
>
> Colin Horne wrote:
>>
>> Hello,
>>
>> I have a long (infinite) XML stream, which I intend to parse with SAX.
>> Each individual element in the stream is small, and should be parsed
>> with JDOM:
>>
>> <stream>
>> <element>...</element>
>> <element>...</element>
>> </stream>
>>
>> So each <element> (and their children) are parsed with JDOM, but the
>> <stream> as a whole is parsed with SAX. It would be preferable if each
>> <element> does not have to be serialized to an encoded string, and if
>> elements are not processed twice (e.g., using SAX to echo the
>> <element>'s XML to a stream, which is then read by JDOM).
>>
>> I've found several references to this problem from the past, but could
>> not find a complete solution.
>>
>> My initial approach was to use the SAXHandler, like so:
>>
>>        jdomHandler = new SAXHandler() {
>>            public void endElement(String uri, String localName, String
>> qName) {
>>                if (qName.equals("an element which I want JDOM to parse"))
>> {
>>                    // change the SAX handler to myHandler
>>                } else {
>>                    super.endElement(uri, localName, qName);
>>                }
>>            }
>>        };
>>
>>
>>        myHandler = new DefaultHandler() {
>>            public void startElement(String uri, String localName,
>> String qName, Attributes attributes) {
>>                if (qName.equals("an element which I want JDOM to parse"))
>> {
>>                    // change the SAX handler to jdomHandler
>>                }
>>            }
>>        };
>>
>>
>> (Ignoring for now that the endElement() method needs to keep track of
>> its nesting level)
>>
>> However, trivially doing the above does not work, and fails after
>> calling jdomHandler.getDocument():
>>
>> Exception in thread "main" java.lang.IllegalStateException: Root element
>> not set
>>        at org.jdom.Document.getRootElement(Document.java:218)
>>
>>
>> I've looked at the initalization code that JDOM is normally doing in
>> the SAXBuilder.build() method, and am reluctant to copy/modify the
>> code, because I suspect it will break with future releases, and can't
>> help but wonder if it would be over-complicating things.
>>
>> Is there a Right Way(TM) to do this? If so, I might also suggest that
>> it's referenced from the FAQ.
>>
>> Many thanks,
>>  Colin
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>>
>>
>
>
>



More information about the jdom-interest mailing list