[jdom-interest] RE: Using JDOM to manipulate HTML

Robertson, Jacob Jacob.Robertson at argushealth.com
Wed Feb 13 07:27:25 PST 2002


In order to get a jdom document of an HTML file, I created a class for
my company that uses the org.w3c.tidy package.  I've included a dumbed
down cut-and-paste version of the class (can't really post proprietary
code from my company), but this code is the basic idea.  What I've
excluded is handling of some of the limitations of Tidy that doesn't
correctly handle things like some comments, and  

Anyway -- here's the code...

import org.jdom.*;
import org.jdom.input.*;
import org.jdom.output.*;

import org.w3c.dom.*;
import org.w3c.tidy.*;

// etc ...

public class DocBuilder {

public static org.jdom.Document buildDocument(File file) throws
IOException, JDOMException {
    org.w3c.dom.Document domHtmlDoc = getHTMLDocument(file);
    org.jdom.Document jdomDoc = getJDOMDoc(domHtmlDoc);
    return jdomDoc;
}
private static org.w3c.dom.Document getHTMLDocument(File file) throws
IOException, JDOMException {
    FileInputStream fin = new FileInputStream(file);
    Tidy tidy = new Tidy();
    tidy.setMakeClean(false);
    //tidy.setQuiet(true);
    tidy.setShowWarnings(true); //tidy.setShowWarnings(false);
    tidy.setTidyMark(false);
    tidy.setNumEntities(true);
    tidy.setQuoteAmpersand(true);
    tidy.setQuoteMarks(true);
    tidy.setQuoteNbsp(false);
    tidy.setHideEndTags(false);
    tidy.setDropEmptyParas(false);
    org.w3c.dom.Document doc = tidy.parseDOM(fin, null);
    fin.close();
    return doc;
}
private static org.jdom.Document getJDOMDoc(org.w3c.dom.Document doc) {
    DOMBuilder db = new DOMBuilder();
    org.jdom.Document jdomDoc = db.build(doc);
    return jdomDoc;
}
}



More information about the jdom-interest mailing list