[jdom-interest] Couple of issues I need help on, (memory, sax vs dom, etc)

Thu Jul 18 12:13:50 PDT 2002

Hi all,

I posted a while ago, but I haven't seen it show up? Anyway, I am right
about at the end of a big project and am having a couple of issues with
JDOM that I am hoping someone can shed light on. First, is there any
recent book on the JDOM API that includes Sax 2.0 and stuff? Sure love
to figure out a few things about it.

Ok, so below I'll describe each issue and hopefully someone can respond
to one (or more) of them. I really appreciate any help on these areas.

1) XML DTD/Schema validation:
   For some reason, if the XML file being read in as a DOCTYPE
indicating a URL to use to validate the XML file against, if the network
connection is not available the XML file is not able to be read in and
parsed, at all. I tried creating the Builder with the false option for
validation, but it seems to have no effect. I also found, I think for
the latest Xerces 2 parser only, the option:

SAXBuilder builder = new SAXBuilder(false);
builder.setFeature("http://apache.org/xml/features/nonvalidating/load
external-dtd", false);

I read that using SAXBuilder over DOMBuilder is better since all parser
implementations currently use Sax to generate a DOM tree anyway, and Sax
is faster and far less memory intensive. At any rate, the option above
does seem to work, but only with xerces (which makes sense since it
references the apache.org url). I don't have a problem with using
Xerces, except that this particular application is a downloadable one
and is very small except for the inclusion of the Xerces jar files,
which take up over 1MB. If anyone has info on how I can replace xerces
with a non-validating parser that is very small and I can bypass the use
of a DTD or Schema, I would love to have a URL and info on it!

So how I validate is simple. I turn off the feature (as the above
snippet shows), then use JDOM to read in a few xml tags that will always
be in a given order for the specific XML format I am reading. If they
are present, the XML is "valid" so to speak. I have tested this, and it
works good enough for our application and ensures the user running the
app does not have to have a network connection to access the URL in the
DocType. So, besides hopefully listing an alternative Sax 2 complaint
small XML parser that avoids dtd/schema validation, is there anything I
can do to perhaps specify a "local" DTD/Schema that I could ship with
the product, instead of having to use a URL based one? Ideally I
wouldn't mind using the parser validation if I could use a local DTD or
schema. The main point is, the app needs to be able to run and parse XML
without a network connection.

2) java.lang.OutOfMemory:
   This was one I just found last night. Scared me quite a bit. The app
needs to allow multiple XML selections. Some XML files may be quite
large, > 10MB in size, even up to 50MB or more. Now, for the most part
this will rarely happen, but it is a potential scenario the app must be
able to handle. When I start the JVM, I am not specifying any memory
parameters. When I select an 8MB xml file, during the "validation"
method I use (which I described above), it throws the out of memory
exception. For an 8MB file??!! That does not make sense to me. JVMs
start up with 64MB RAM usage. How does an 8MB XML file translate to out
of memory. Now, the process I do is loop through all selected files. On
each iteration, I create a new SAXBuilder object, and a new Document
object. I would assume since at the end of each iteration I am done with
the objects, they get GC'd at some point. So the next step of the app,
which then parses the xml for "header" data, also creates a new
SAXBuilder and Document object and discards it at the end, and so on.
The final and 3rd step is to parse the XML again, getting the "body" of
the xml data. The error is occurring during the first step. If I select
a single XML 8MB or larger, or multiple XML that equal 8MB or more, I
get the out of memory error. When I select 7 1MB files, its fine. As
soon as I approach the 8MB in total size of all selected files, I am out
of memory. It would seem to me that it is in my code, but since I loop
through and create then discard the JDOM objects, I am at a loss as to
why this is happening. It would also seem that a single file may end up
using way more memory than is available, but again, this is not the case
because selecting several smaller files that add up to over 8MB ends up
doing the same thing! Lastly, is there a reason why the JVM does NOT use
virtual memory when it runs out of its allocated memory? The main reason
I ask this is that our client machines that this app will be installed
on may only have 32MB of physical memory, and we can not have them
upgrade. So, while I am sure the OS will "swap" memory with the JVM
since the JVM starts up using 64MB RAM, why can the JVM itself not swap
memory out to allow more than its startup? Or should I just start it up
specifying 1GB of memory max, 64MB min, and leave it at that? Still,
this is NOT resolving the issue. I don't want to just throw more memory
at the problem, I want to know why it is not working and fix it.

3)  Sax vs Dom:
    This is less of a problem but more of a clarification for me. I read
that DOM uses SAX to generate the DOM object in most parser
implementations. I also read that SAX is event based, reads/parses the
XML faster and uses far less memory. This leads me to believe using
SAXBuilder is the way to go. But, what I am confused about is the steps
involved. Usually, using SAX you have to register events. Since JDOM
does this for me, does that mean the entire XML document is read into
memory even using SAX? Or as I iterate through the nodes using JDOM, it
uses SAX in the same manner? If JDOM just uses SAX to create a fully
loaded XML Document, then what is the difference (other than speed) when
using SAX? Both seem like they result in the entire XML file being read
into memory? In my case, I don't want that. I want to use as little
memory as possible, so that as I "validate" the XML, or read just the
header data, I am not wasting memory or time in reading in the entire
document. I thought JDOM would be basically a more "java" oriented way
of doing the same thing that the current XML APIs do, but would still
conserve memory and all the benefits of using SAX. So if someone could
clarify to me how JDOM uses SAX, and if there is some way to conserve
memory, only read in a part of the XML, etc? For example, in my XML
docs, the "header" info is always in a node below the root. Is there a
way I could specify that the only data I need at a given point is only
this node, so that JDOM will only read in that part of the file, and not
the whole thing each time?

Thank you very much for any help and the time spent to read my issues. I
appreciate it very much.