[jdom-interest] Re: Manipulating a very large XML file

Wed Mar 16 22:22:42 PST 2005

Jason Robbins wrote:

> Oh, I agree.  As a computer scientist, I know that both 8*N and N/2
> are both O(N), so from that point of view, it really doesn't matter.
> In the long run, as N continues to grow, people absolutely need
> to switch to a database approach.

Yes, exactly.  I'm sure you also would agree that if you can change the 
coefficient or the exponent, you do the exponent first.  Change O(N) to 
O(1) then worry about the performance of the O(1).  That's why I would 
encourage people interested in supporting larger XML quantities to 
devote their efforts to the database approach.

However if that doesn't interest you, then yes there is improvement to 
be had by changing the coefficient and it's something that does have 
practical benefit.

>>Or if you want a commercial grade solution, look at Mark Logic.  You can 
>>get a 30 day trial that supports data sets up to 1G 
>>(http://xqzone.marklogic.com).  The official product goes four orders of 
>>magnitude larger than that.  It's really fun.
> 
> Cool.  Hardcore!
> 
> If a dataset contains gigabytes, doesn't that make it more likely
> that the results of a given query could be tens of megabytes?

 From my experience, result size isn't generally proportional to the 
content set size.

O'Reilly for example is loading all their book and article content into 
Mark Logic.  It's a fair bit of content, but typical results are scoped 
to the size appropriate for human consumption (multiples of chapters and 
sections).

But yes, as you predict, some queries may return multi-megabyte answers. 
  Custom book printing is a good example where you output a large XSL-FO 
document.  From what I see, result size depends more on the nature of 
the query rather than on the content set size.

> In a relational database, the RDBMS can return a large rowset
> as a stream, and the application goes through it row-by-row.
> If an XML query results in a big nodelist, that could certainly
> be streamed.  But, if it results in a big sub-tree, doesn't
> that need to be represented in RAM in an efficient way?

Yes, absolutely, that's a case where a more memory efficient 
implementation would come in handy.  Normally XQuery returns a sequence 
of XML nodes that you handle like you handle rows in a relational model 
(pull based).  The less memory it takes to handle each node, the more 
effecient your handling and if any node is large...

In designing JDOM there's always been a tradeoff between features and 
memory size.  We've tried to strike a middle ground.  It sounds like 
you're thinking of preserving the features but changing the 
representation, probably trading time for memory?

>>Here's a screencast I did with Jon Udell showing off XQuery to 
>>manipulate some O'Reilly books in docbook format:
>>   http://weblog.infoworld.com/udell/2005/02/15.html#a1177
> 
> Very cool.  I definitely need to learn more about xquery.

I think you'll enjoy it.  Let me know if you have questions.  There's a 
vendor neutral mailing list at http://xquery.com and a Mark Logic 
specific one at http://xqzone.marklogic.com.

-jh-