[jdom-interest] Feature Request

Chris B. chris at tech.com.au
Thu Feb 19 19:51:19 PST 2004


Thanks for that! It gives me another one to try.

For what I'm doing I need the best HTML parser I can lay my hands on, 
and the more cruddy HTML it can grok without blowing up, the happier I 
will be. Do you think it is better than Neka and JTidy?

I'll also take whatever patches you've got as well for testing.

Dennis Sosnoski wrote:

> I'd suggest instead using TagSoup 
> (http://www.ccil.org/~cowan/XML/tagsoup). It implements its own SAX2 
> parser for HTML, so doesn't interfere with anything else in your 
> system. The only downside I've noticed is that the handling it uses to 
> turn HTML into XHTML can go berserk in some cases of real-world HTML, 
> such as <script> and <style> elements within the <body> (it properly 
> tries to force them into a <head> element, so you end up with multiple 
> <head>s and <body>s). I've figured out how to easily patch it to get 
> around some of these issues, so let me know if you run into problems.
>
>  - Dennis
>
> Chris B. wrote:
>
>> Jeremy.Prellwitz at siras.com wrote:
>>
>>  
>>
>>> It is not NekoHTML that i'm worried about.
>>>   
>>
>>
>> I'm worried about it because I suspect I will have to do some major 
>> work on either NekoHTML or JTidy for a project I'm working on, and I 
>> want to understand the situation as clearly as possible, because if 
>> that happens I *may* have an opportunity to fix Neko properly.
>>
>>  
>>
>>> It is parsing regular XML documents in the same webapp. 
>>>   
>>
>>
>> According to the Neko web site....
>> " The Xerces2 implementation dynamically instantiates the default 
>> parser configuration to construct parser objects via the Jar service 
>> facility. The Jar file |nekohtmlXni.jar| contains a 
>> |META-INF/services| file that is read by Xerces2 implementation for 
>> this purpose."
>>
>> If I understand this correctly, if you don't use nekohtmlXni.jar, 
>> then you won't have the problem?
>>
>>
>>  
>>
>>> Basically, NekoHTML interferes with the
>>> creation of Xerces parsers'.    When i create a SAXBuilder object, it
>>> creates a parser that is using the HTML configuration setup by 
>>> NekoHTML.
>>> If I could create my own Xerces parser, and instantiate it with the
>>> specific standard configuration class that it needs, and then pass 
>>> it into
>>> the constructor of the SAXBuilder object, then i don't have to worry 
>>> about
>>> a the SAXBuilder object creating a parser on its own, that uses the 
>>> HTML
>>> configuration setup by NekoHTML.
>>>
>>>
>>> -jeremy
>>>
>>>
>>>                                                                          
>>>            "Chris 
>>> B."                                                               
>>> <chris at tech.com.a                                             
>>>            
>>> u>                                                         To 
>>>                                      
>>> Jeremy.Prellwitz at siras.com                     02/19/2004 
>>> 05:55                                           cc            
>>> PM                        jdom-interest at jdom.org              
>>>                                                                  
>>> Subject                                      Re: [jdom-interest] 
>>> Feature Request 
>>>                                                                          
>>>                                                                          
>>>                                                                          
>>>                                                                          
>>>                                                                          
>>>                                                                          
>>>
>>>
>>>
>>>
>>>
>>>
>>> As much as I think its a good idea, how would it help you directly,
>>> since NekoHTML doesn't seem to conform to XMLReader? (Which seems to be
>>> its problem).
>>>
>>>
>>> Jeremy.Prellwitz at siras.com wrote:
>>>
>>>
>>>
>>>   
>>>
>>>> This is what I was trying to describe, just without mentioning it as
>>>> specifically/consisely as you just did.  I wouldn't have brought up 
>>>> my own
>>>> little issue if I didn't think that passing in your own XMLReader 
>>>> instance
>>>> could offer usefulness to others.  It seems like a simple enough 
>>>> change to
>>>> the SAXBuilder.java class, and conincidently, it would smooth out 
>>>> my code
>>>>  
>>>>     
>>>
>>> a
>>>
>>>
>>>   
>>>
>>>> little bit. :-)
>>>>
>>>> -jeremy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>  
>>>>     
>>>>
>>>>> It seems to me that supplying your own XMLReader is a sensible enough
>>>>> activity that it deserves a proper method or constructor in 
>>>>> SAXBuilder
>>>>> to pass it in.
>>>>>
>>>>>
>>>>>    
>>>>>       
>>>>
>>>>
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>           "Chris B."
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>           <chris at tech.com.a
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>           
>>>> u>                                                         To
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>                                     Jason Hunter <jhunter at xquery.com>
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>           02/19/2004 
>>>> 05:00                                           cc
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>           PM                        Jeremy.Prellwitz at siras.com,
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>                                     jdom-interest at jdom.org
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>                                                                 
>>>> Subject
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>                                     Re: [jdom-interest] Feature 
>>>> Request
>>>>  
>>>>     
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>   
>>>
>>>> Jason Hunter wrote:
>>>>
>>>>
>>>>
>>>>  
>>>>     
>>>>
>>>>> Sounds like nekohtml is being a Bad Citizen, but I think you can do
>>>>> exactly what you want by subclassing SAXBuilder and overriding
>>>>> createParser().
>>>>>
>>>>>
>>>>>    
>>>>>       
>>>>
>>>> It seems to me that supplying your own XMLReader is a sensible enough
>>>> activity that it deserves a proper method or constructor in SAXBuilder
>>>> to pass it in.
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> To control your jdom-interest membership:
>>>> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>>>
>>>>  
>>>>     
>>>
>>>
>>>
>>>   
>>>
>>>>  
>>>>     
>>>
>>> _______________________________________________
>>> To control your jdom-interest membership:
>>> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>>
>>>
>>>
>>>   
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>>
>>
>>  
>>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@yourhost.com 
>




More information about the jdom-interest mailing list