[jdom-interest] Simple xhtml/entity resolver?

Paul Libbrecht paul at hoplahup.net
Thu Mar 29 14:24:26 PDT 2012


Oliver,

I'm curious, did you ever get an entityRef?
To my experience, no SAXBuilder gives you them... Also, they will transform any numeric reference to a character.

Now, still, I tried to respond to your request and I could not.
Watching the XMLOutputter, I saw that it was actually outputting the entity ref itself (namely: the ampersand, the name, a semicolon), and indeed the EntityRef object does not carry any information that allows you to "resolve it".

The last step, entity-resolution, actually is the business of the DTD.
The Entity-references for xhtml are among the reasons of the xhtml dtd's enormous weight. If I remember well, mathml has an entity-definition-table that may be easier to process (also available as xml in case).

Also, beware if you want to parse XHTML:
- with a DTD, and without some "public/private catalog", you get a DTD loaded from W3C very slowly (and denying after a while)
- without it, all entity-references are broken.
... maybe you don't parse it?

All in all, could I conjecture the entity-ref objects are actually programmatically created? If yes, you need to expand them as a programme using the table mentioned above (could be a nice contrib).

hope it helps.

paul


Le 29 mars 2012 à 18:54, Oliver Ruebenacker a écrit :

>     Hello,
> 
>  Thanks for all the advice, but it seems I did not make myself
> sufficiently clear.
> 
>  My situation is this: some one else already parsed XHTML and gave me
> the JDOM element that represents a fragment of it.
> 
>  Let us say the original fragment looks something like this:
> 
>  "<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
>  "<p><b>© 2012</b> by <em>Dewey, Cheetham & Howe</em></p>"
>  "<p><b>&#x00a9; 2012</b> by <em>Dewey, Cheetham  Howe</em></p>"
> 
>  I never get to see that fragment, but instead an object of type
> Element. What I want to get is a String that looks roughly like this:
> 
>  "© 2012 by Dewey, Cheetham & Howe"
> 
>  A simple lightweight solution that is roughly acceptable in most
> simple cases is fine for my purpose.
> 
>  So I am trying a recursive method that iterates over
> Element.getContent() and then I am wondering what to do if the content
> happens to be EntityRef?
> 
> package cbit.vcell.model.summaries;
> 
> import org.jdom.Comment;
> import org.jdom.DocType;
> import org.jdom.Element;
> import org.jdom.EntityRef;
> import org.jdom.ProcessingInstruction;
> import org.jdom.Text;
> 
> public class XHTMLToPlainTextConverter {
> 
> 	public static String convert(Element element) {
> 		String text = "";
> 		for(Object content : element.getContent()) {
> 			if(content instanceof Comment) {
> 				// ignore
> 			} else if(content instanceof DocType) {
> 				// ignore
> 			} else if(content instanceof Element) {
> 				Element childElement = (Element) content;
> 				text = text + convert(childElement);
> 			} else if(content instanceof EntityRef) {
> 				EntityRef ref = (EntityRef) content;
> 				text = text + ref; // ???
> 			} else if(content instanceof ProcessingInstruction) {
> 				// ignore
> 			} else if(content instanceof Text) {
> 				Text childText = (Text) content;
> 				text = text + childText.getText();
> 			} else {
> 				// ignore, should not happen
> 			}
> 		}
> 		return text;
> 	}
> 	
> }
> 
>  Thanks!
> 
>     Take care
>     Oliver
> 
> On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt <thechrispratt at gmail.com> wrote:
>> Another option I've used in the past is changing the underlying SAX parser
>> that jDOM uses to TagSoup ( http://ccil.org/~cowan/XML/tagsoup/).  Their
>> parser is tuned to parsing not fully XML compliant HTML.
>> 
>>   (*Chris*)
>> 
>> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet
>> <olivier.jaquemet at jalios.com> wrote:
>>> 
>>> Hi Oliver,
>>> 
>>> JDom is a great tool for parsing XML...
>>> 
>>> ... but for XHTML fragment (which may not be completely XHTML compliant
>>> ... ?)
>>> and specially for text extraction, I would strongly suggest JSoup
>>> http://jsoup.org/
>>> 
>>>  String text = org.jsoup.Jsoup.parse(html).text();
>>> 
>>> Whatever is your html it will work like a charm (even it is an ugly copy
>>> paste wysiwyg from word or any ugly html export from whatever website)
>>> 
>>> Olivier
>>> 
>>> 
>>> On 29/03/2012 15:23, Oliver Ruebenacker wrote:
>>>> 
>>>>      Hello,
>>>> 
>>>>   I need a simple way to convert some XHTML fragments, provided as a
>>>> JDOM Element, into plain text. I am willing to ignore most HTML tags
>>>> and consider only the most commonly used predefined entities.
>>>> 
>>>>   In JDOM, an entity reference has a name, a public id and a system
>>>> id. I think I know what the named means, for named entities. But what
>>>> about numeric entities, how do I get the code point? And what are
>>>> public id and system id?
>>>> 
>>>>   Thanks!
>>>> 
>>>>      Take care
>>>>      Oliver
>>>> 
>>> 
>>> --
>>> Olivier Jaquemet<olivier.jaquemet at jalios.com>
>>> Ingénieur R&D Jalios S.A. - http://www.jalios.com/
>>> @OlivierJaquemet +33970461480
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> To control your jdom-interest membership:
>>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>> 
>> 
>> 
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
> 
> 
> 
> -- 
> Oliver Ruebenacker, Computational Cell Biologist
> Virtual Cell (http://vcell.org)
> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org)
> http://www.oliver.curiousworld.org
> 
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com




More information about the jdom-interest mailing list