[jdom-interest] Couple of issues I need help on, (memory, sax vs dom, etc)

Thu Jul 18 12:57:37 PDT 2002

Did you consider to use an EntityResolver to capture the DTD reference
and replace it?  Actually you can redirect the DTD that the builder is
trying to read to something you control yourself.

Stephan

On Thu, Jul 18, 2002 at 12:13:50PM -0700, Duffey, Kevin wrote:
> Hi all,
> 
> I posted a while ago, but I haven't seen it show up? Anyway, I am right
> about at the end of a big project and am having a couple of issues with
> JDOM that I am hoping someone can shed light on. First, is there any
> recent book on the JDOM API that includes Sax 2.0 and stuff? Sure love
> to figure out a few things about it.
> 
> Ok, so below I'll describe each issue and hopefully someone can respond
> to one (or more) of them. I really appreciate any help on these areas.
> 
> 1) XML DTD/Schema validation:
>    For some reason, if the XML file being read in as a DOCTYPE
> indicating a URL to use to validate the XML file against, if the network
> connection is not available the XML file is not able to be read in and
> parsed, at all. I tried creating the Builder with the false option for
> validation, but it seems to have no effect. I also found, I think for
> the latest Xerces 2 parser only, the option:
> 
> SAXBuilder builder = new SAXBuilder(false);
> builder.setFeature("http://apache.org/xml/features/nonvalidating/load
> external-dtd", false);
> 
> I read that using SAXBuilder over DOMBuilder is better since all parser
> implementations currently use Sax to generate a DOM tree anyway, and Sax
> is faster and far less memory intensive. At any rate, the option above
> does seem to work, but only with xerces (which makes sense since it
> references the apache.org url). I don't have a problem with using
> Xerces, except that this particular application is a downloadable one
> and is very small except for the inclusion of the Xerces jar files,
> which take up over 1MB. If anyone has info on how I can replace xerces
> with a non-validating parser that is very small and I can bypass the use
> of a DTD or Schema, I would love to have a URL and info on it!
> 
> So how I validate is simple. I turn off the feature (as the above
> snippet shows), then use JDOM to read in a few xml tags that will always
> be in a given order for the specific XML format I am reading. If they
> are present, the XML is "valid" so to speak. I have tested this, and it
> works good enough for our application and ensures the user running the
> app does not have to have a network connection to access the URL in the
> DocType. So, besides hopefully listing an alternative Sax 2 complaint
> small XML parser that avoids dtd/schema validation, is there anything I
> can do to perhaps specify a "local" DTD/Schema that I could ship with
> the product, instead of having to use a URL based one? Ideally I
> wouldn't mind using the parser validation if I could use a local DTD or
> schema. The main point is, the app needs to be able to run and parse XML
> without a network connection.
> 
> 
> 2) java.lang.OutOfMemory:
>    This was one I just found last night. Scared me quite a bit. The app
> needs to allow multiple XML selections. Some XML files may be quite
> large, > 10MB in size, even up to 50MB or more. Now, for the most part
> this will rarely happen, but it is a potential scenario the app must be
> able to handle. When I start the JVM, I am not specifying any memory
> parameters. When I select an 8MB xml file, during the "validation"
> method I use (which I described above), it throws the out of memory
> exception. For an 8MB file??!! That does not make sense to me. JVMs
> start up with 64MB RAM usage. How does an 8MB XML file translate to out
> of memory. Now, the process I do is loop through all selected files. On
> each iteration, I create a new SAXBuilder object, and a new Document
> object. I would assume since at the end of each iteration I am done with
> the objects, they get GC'd at some point. So the next step of the app,
> which then parses the xml for "header" data, also creates a new
> SAXBuilder and Document object and discards it at the end, and so on.
> The final and 3rd step is to parse the XML again, getting the "body" of
> the xml data. The error is occurring during the first step. If I select
> a single XML 8MB or larger, or multiple XML that equal 8MB or more, I
> get the out of memory error. When I select 7 1MB files, its fine. As
> soon as I approach the 8MB in total size of all selected files, I am out
> of memory. It would seem to me that it is in my code, but since I loop
> through and create then discard the JDOM objects, I am at a loss as to
> why this is happening. It would also seem that a single file may end up
> using way more memory than is available, but again, this is not the case
> because selecting several smaller files that add up to over 8MB ends up
> doing the same thing! Lastly, is there a reason why the JVM does NOT use
> virtual memory when it runs out of its allocated memory? The main reason
> I ask this is that our client machines that this app will be installed
> on may only have 32MB of physical memory, and we can not have them
> upgrade. So, while I am sure the OS will "swap" memory with the JVM
> since the JVM starts up using 64MB RAM, why can the JVM itself not swap
> memory out to allow more than its startup? Or should I just start it up
> specifying 1GB of memory max, 64MB min, and leave it at that? Still,
> this is NOT resolving the issue. I don't want to just throw more memory
> at the problem, I want to know why it is not working and fix it.
> 
> 3)  Sax vs Dom:
>     This is less of a problem but more of a clarification for me. I read
> that DOM uses SAX to generate the DOM object in most parser
> implementations. I also read that SAX is event based, reads/parses the
> XML faster and uses far less memory. This leads me to believe using
> SAXBuilder is the way to go. But, what I am confused about is the steps
> involved. Usually, using SAX you have to register events. Since JDOM
> does this for me, does that mean the entire XML document is read into
> memory even using SAX? Or as I iterate through the nodes using JDOM, it
> uses SAX in the same manner? If JDOM just uses SAX to create a fully
> loaded XML Document, then what is the difference (other than speed) when
> using SAX? Both seem like they result in the entire XML file being read
> into memory? In my case, I don't want that. I want to use as little
> memory as possible, so that as I "validate" the XML, or read just the
> header data, I am not wasting memory or time in reading in the entire
> document. I thought JDOM would be basically a more "java" oriented way
> of doing the same thing that the current XML APIs do, but would still
> conserve memory and all the benefits of using SAX. So if someone could
> clarify to me how JDOM uses SAX, and if there is some way to conserve
> memory, only read in a part of the XML, etc? For example, in my XML
> docs, the "header" info is always in a node below the root. Is there a
> way I could specify that the only data I need at a given point is only
> this node, so that JDOM will only read in that part of the file, and not
> the whole thing each time?
> 
> Thank you very much for any help and the time spent to read my issues. I
> appreciate it very much.
> 
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com

-- 
[------------ Stephan Trebels <stephan at ncube.de>, Consultant -----------]
company: nCUBE Deutschland GmbH, Hanauer Str. 56, 80992 Munich, Germany
phone: cell:+49 172 8433111  office:+49 89 149893 0  fax:+49 89 149893 50