[jdom-interest] Fwd: Re: Simple xhtml/entity resolver?

Rolf Lear jdom at tuis.net
Thu Mar 29 08:25:54 PDT 2012


and I replied to Olive only too... hmmm

Rolf

-------- Original Message --------
Subject: Re: [jdom-interest] Simple xhtml/entity resolver?
Date: Thu, 29 Mar 2012 11:15:26 -0400
From: Rolf Lear <jdom at tuis.net>
To: Oliver Ruebenacker <curoli at gmail.com>

Ahh,

In order to discuss the 'entity' processing, you need to be careful about
how you specify the 'location' of the data...

For example, there are three basic 'locations' for content when we
consider JDOM, the 'unparsed XML', the 'JDOM Document', and the 'output'.

Also, when you say &169; do you mean &169; or do you actually mean © 
? There is a *big* difference....

When you parse 'unparsed XML' the parser will always translate character
escapes to the actual character, for example, © will become ©. JDOM
will never see the '©'. If, for example, in the 'unparsed XML' file,
you had <root att="&169;" />, then, when parsed and given to JDOM, you
will
always have the single char © as root.getAttributeValue("att").

When you output that value from JDOM, JDOM will use the 'charset' of the
output destination to determine whether the © char needs to be escaped.
For
example, the following 'program':

		SAXBuilder builder = new SAXBuilder();
		Document doc = builder.build(new StringReader("<root att='©' />"));
		System.out.println(doc.getRootElement().getAttributeValue("att"));
		XMLOutputter xout = new XMLOutputter();
		xout.output(doc, System.out);

outputs:

©
<?xml version="1.0" encoding="UTF-8"?>
<root att="©" />


Having said that, you must understand that JDOM *expects* to be given
'un-escaped' data. If you tell JDOM to set the value for attribute 'attb'
to the String '©' then JDOM will do that, and, when you output the
value, it will escape the '&' for you so that the value '©' is
preserved.... for example, if we add the following lines to the above
program:

		doc.getRootElement().setAttribute("attb", "©");
		xout.output(doc, System.out);

the output is now:

©
<?xml version="1.0" encoding="UTF-8"?>
<root att="©" />
<?xml version="1.0" encoding="UTF-8"?>
<root att="©" attb="©" />


So, making sure that we have a good understanding of the concept of
character escapes, you must realize that they are *not*
EntityReferences...
you should never see any JDOM object representing a character escape.

On the other hand, if you had the entity reference '©' in your
'unparsed XML', the parser (by default) should have replaced it with the
appropriate character(s) when the document was parsed. Again, JDOM will
see
the character © and not the reference '©'. A 'default' parser will
fail to parse a document if it has references that cannot be resolved. If
you change the default parse behaviour (to remove the entity-resolve
process), then instead of the © character, you will have a JDOM EntityRef
with the name 'copy'.

In other words, you have to go out of your way to create EntityRef
instances. If you want to ignore the processes the parser uses to resolve
entities, then you will need to scan the JDOM tree, look for EntityRefs,
and manually replace them with the appropriate Text.... using whatever
strategy you want to use.



In a more general answer to your original question 'how do I basically
replace a browser', though, what you really want to be doing is a
Transform
on your JDOM document, to create an appropriate output for your needs. The
transform you use will depend on what results you want. Have a look at
XSLTransform class in JDOM, as well as the various resources on the net
for
XSL Transformations.


Rolf



On Thu, 29 Mar 2012 10:28:26 -0400, Oliver Ruebenacker <curoli at gmail.com>
wrote:
> Hello Rolf,
> 
>   I think there is a misunderstanding. I don't want to output as XML.
> I want to render the XHTML as text like a very primitive browser would
> display it.
> 
>   I'm building a String by traversing the tree by calling
> Element.getContent(). For example, a © can be encoded in XML as
> "©". Presumably, the Element tree would contain an EntityRef with
> name "copy". But what if an XML document contains "&169;" or
> "&x00A9;"? How would the EntityRef object look like?
> 
>   Thanks!
> 
>      Take care
>      Oliver
> 
> On Thu, Mar 29, 2012 at 9:46 AM, Rolf Lear <jdom at tuis.net> wrote:
>>
>> Hi Oliver.
>>
>> If you already have the XHTML content as JDOM Elements, then you should
>> be
>> able to (just) do:
>>
>> XMLOutputter xout = new XMLOutputter();
>> String fragment = xout.outputString(element);
>>
>> If you want to change the format of the output (indenting, etc.), you
can
>> add a 'Format' to the XMLOutputter with:
>>
>> XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
>> String fragment = xout.outputString(element);
>>
>>
>> I think you may be chasing a red-herring with the Entity References.
>>
>> The EntityRef code is a 'CYA' implementation, but, in reality, the
>> SystemID and PublicID are never going to be needed in regular usage.
>>
>> The only place I know of where you have entity references is if you
>> specify your input parser should ignore entity-reference lookups when
>> parsing, and in JDOM you will end up with an EntityRef instead of it's
>> 'underlying' text.
>>
>> Rolf
>>
>>


More information about the jdom-interest mailing list