[jdom-interest] XML Schema classification help

Michael Kay mike at saxonica.com
Wed Jan 4 15:05:12 PST 2012


On 04/01/2012 19:11, cliff palmer wrote:
> I need to examine XML documents contained in multiple columns in a 
> database table with over a million rows and identify each of the 
> different structures used for the XML data, producing a count if the 
> number of instances that use each structure.
>
> I thought of using the SAXParser then creating a list of the XML 
> headers in the order used and storing each unique list and 
> accumulating a count based on matching an already encountered list 
> object, but I am hoping there is a less cumbersome approach.
>
> I would appreciate any and all suggestions.
>
You've chosen an odd place to ask the question, since there's nothing 
specific in JDOM that will help you.

The key thing you need to do is to define what are the rules for your 
taxonomy. Presumably it's something more complex than categorizing 
documents by the name of their root element, or the namespaces they use. 
But presumably a document with four paragraphs and two images and one 
with five paragraphs and no images go in the same bucket. So what are 
the rules?

Michael Kay
Saxonica


More information about the jdom-interest mailing list