[jdom-interest] Fwd: XML 1.1 -- Please stab me with a dull knife and trample my dead body

Rolf Lear jdom at tuis.net
Fri Sep 7 11:48:01 PDT 2012


Hi Wilf.

You are getting your wires crossed..... In your mail you referenced parsed
and external entities. These have nothing to do with PCDATA (parsed
character data - regular XML text), and CDATA (unparsed character data -
<![CDATA[ ... ]]> )

Michael was answering your question based on the 'entities', where as you
want the details on the 'PCDATA' and the 'CDATA'.

So, forget about the 'entity' references, and focus on the valid character
data for XML.

The only difference between CDATA (character blocks between <![CDATA[  and
]]> ) and PCDATA (element 'text'), is that the XML Parser will look for
'<' and '&' characters in PCDATA, but not in CDATA.

With the correct escaping, all CDATA content can be expressed as PCDATA
content.

This does not help you though, because not all Java 'char' characters are
valid Unicode characters, and thus not all chars are valid as either CDATA
or PCDATA.

In XML 1.0 this distinction was clear.

In XML 1.1 I am not certain how to interpret the difference between
'Chars' and 'RestrictedChars': http://www.w3.org/TR/xml11/#charsets

JDOM takes a 1.0 perspective on Characters... which may be a problem, but
it is not going to solve your issues even if it supports 1.1 chars.

Rolf




On Fri, 7 Sep 2012 08:45:33 -0700, Canadian Wilf <canwilf at gmail.com>
wrote:
> Then what is the proper mode:
> 
> Element e = new Element("foo")
> 
> Should I do this:
> 
> e.setText(string_of_sanitized_data_with_illegal_characters_escaped);
> 
> or
> 
> e.setText(any_text);
> 
> 
> Wilf
> 
> 
> On Fri, Sep 7, 2012 at 6:05 AM, Michael Kay <mike at saxonica.com> wrote:
> 
>>  No, that's all wrong. The contents of an unparsed entity are always an
>> external resource, they are never part of a text or attribute node.
>> Parsed
>> entities do become part of the content, but they must always use the
XML
>> character set.
>>
>> Michael Kay
>> Saxonica
>>
>> On 07/09/2012 13:10, Canadian Wilf wrote:
>>
>> According to the xml 1.1 spec:
>>
>>  4 Physical Structures ...
>>> [Definition: An *unparsed entity* is a resource whose contents may or
>>> may not be text <http://www.w3.org/TR/xml11/#dt-text>, and if text,
may
>>> be other than XML. Each unparsed entity has an associated
>>> notation<http://www.w3.org/TR/xml11/#dt-notation>,
>>> identified by name. Beyond a requirement that an XML processor make
the
>>> identifiers for the entity and notation available to the application,
>>> XML
>>> places no constraints on the contents of unparsed entities.]
>>
>>
>>
>>  AND
>>
>>  Entities may be either parsed or unparsed. [Definition: The contents
of
>>> a *parsed entity* are referred to as its replacement
>>> text<http://www.w3.org/TR/xml11/#dt-repltext>;
>>> this text <http://www.w3.org/TR/xml11/#dt-text> is considered an
>>> integral part of the document.]
>>
>> [Definition: An *unparsed entity* is a resource whose contents may or
may
>>> not be text <http://www.w3.org/TR/xml11/#dt-text>, and if text, may be
>>> other than XML. Each unparsed entity has an associated
>>> notation<http://www.w3.org/TR/xml11/#dt-notation>,
>>> identified by name. Beyond a requirement that an XML processor make
the
>>> identifiers for the entity and notation available to the application,
>>> XML
>>> places no constraints on the contents of unparsed entities.]
>>> Parsed entities are invoked by name using entity references; unparsed
>>> entities by name, given in the value of *ENTITY* or *ENTITIES*
>>>  attributes.
>>
>>
>>
>>  In the current JDOM version, Element method setText(string) and also
>> addContent(CDATA) refuses text that contains illegal characters. It is
>> treating the data provided as 'parsed' when it should by the spec be
>> treating it as free content.
>>
>>  I understand:
>>
>>   1) The xml 1.1 spec defines a parsed entity as its 'replacement
text'.
>>
>>  2) Replacement text' would refer to the actual textual makeup of a
>> serialized Element, not the data an Element holds in a Text content
>> element
>>
>>
>>  Then, if the above is true, the current implementation is actually
wrong
>> to verify data.
>>
>>  I propose that JDOM stop verifying data set as Element text and CDATA
>> and leave it to the xerces (or whatever) to make sure the document is
>> proper 1.1.
>>
>>  Am I understanding everything correctly?
>>
>>  Thoughts?
>>
>>  ---------- Forwarded message ----------
>> From: Canadian Wilf <canwilf at gmail.com>
>> Date: Thu, Sep 6, 2012 at 9:52 PM
>> Subject: XML 1.1 -- Please stab me with a dull knife and trample my
dead
>> body
>> To: jdom-interest at jdom.org
>>
>>
>> Hi All,
>>
>>  I just learned that in order to safely use JDOM2, I will need to
>> sanitize my Element .setText(string) so that the parsed data does not
>> contain verboten characters under the XML 1.1 spec.
>>
>>  I have an ascii processor and it needs to be able to use xml as a
>> document format. Unfortunately, not all ascii is allowed in an Element
>> text.
>>
>>  Stab me with a dull knife and trample my dead body. But ..... please
>> please please don't make me sanitize all my data before putting it into
>> XML
>> Elements.
>>
>>  1) It makes my programming task much more cumbersome because I must
>> ensure not to feed any of the new verboten and doomed ascii/UTF-8
>> characters to store as xml text.
>>
>> 2) No one uses xml 1.1, do they?
>>
>>  3) It slows down the parsing (a very small amount) with all the
element
>> text checking.
>>
>>  Now that JDOM2 is xml 1.1 compatible, is there any turning back. Can
>> this be undone?
>>
>>  Does everyone understand that their software will bust if data
provided
>> as text is not adhering to the new standard?
>>
>>  What about you? How do you deal with it when using the libraries?
>>
>>  Wilf
>>
>>
>>
>> _______________________________________________
>> To control your jdom-interest
>>
membership:http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>>
>>
>>
>> _______________________________________________
>> To control your jdom-interest membership:
>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>>


More information about the jdom-interest mailing list