[jdom-interest] Fwd: XML 1.1 -- Please stab me with a dull knife and trample my dead body

Fri Sep 7 16:29:15 PDT 2012

So, I have been studying up on the Chars and RestrictedChars in the 
XML1.1 spec.

My personal feeling is that the RestrictedChars mechanism for specifying 
the document format is somewhat complicated, but I now believe I have 
'grokked' it. It all boils down to these four constraints:

1. There are two sets of Characters defined for XML:

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | 
[#x86-#x9F]

RestrictedChar is a subset of Char

2. a valid XML *unparsed* document is defined as:

document ::= ( prolog element Misc* ) - ( Char* RestrictedChar Char* )

3. prolog, element, and Misc are all (indirectly) constrained to 'Char' 
based characters.

4. Character and entity references must resolve to data from the 'Char' 
set... http://www.w3.org/TR/xml11/#sec-references

Based on the four statements above it is apparent that a valid document 
consists of a prolog (which may be empty), an element (which must 
exist), and followed by optional comments, PI's and whitespace. Further, 
there are not allowed to be any restricted chars in the *unparsed* 
document anywhere.

But, a big difference between XML 1.0 and 1.1 is that the Char dataset 
for 1.1 is larger than 1.0 (it includes [#x1-#xD7FF] instead of 'just' 
#x9 | #xA | #xD | [#x20-#xD7FF] )

So, XML 1.1 includes all the low-value control characters.... but, it 
*Restricts* them from appearing *raw* in the unparsed document. It goes 
even further, and it also restricts the following chars in the 
*unparsed* document: [#x7F-#x84] | [#x86-#x9F].

In XML 1.1 though, you can use a char reference to display these 
restricted chars like &#x1;

Unfortunately for you, Wilf, XML 1.1 still makes the following Java char 
values illegal as XML characters: 0x0000, 0xD800-0xDFFF, and 0xFFFF

JDOM 2.x follows JDOM 1.x and allows the set of characters defined for 
XML 1.0.

This is likely a problem. Unfortunately, it is not easily possible for 
JDOM to 'infer' whether it is working with an XML 1.0 or 1.1 document.

Perhaps this needs some thought.

Rolf

On 07/09/2012 2:48 PM, Rolf Lear wrote:
>
> Hi Wilf.
>
> You are getting your wires crossed..... In your mail you referenced parsed
> and external entities. These have nothing to do with PCDATA (parsed
> character data - regular XML text), and CDATA (unparsed character data -
> <![CDATA[ ... ]]> )
>
> Michael was answering your question based on the 'entities', where as you
> want the details on the 'PCDATA' and the 'CDATA'.
>
> So, forget about the 'entity' references, and focus on the valid character
> data for XML.
>
> The only difference between CDATA (character blocks between <![CDATA[  and
> ]]> ) and PCDATA (element 'text'), is that the XML Parser will look for
> '<' and '&' characters in PCDATA, but not in CDATA.
>
> With the correct escaping, all CDATA content can be expressed as PCDATA
> content.
>
> This does not help you though, because not all Java 'char' characters are
> valid Unicode characters, and thus not all chars are valid as either CDATA
> or PCDATA.
>
> In XML 1.0 this distinction was clear.
>
> In XML 1.1 I am not certain how to interpret the difference between
> 'Chars' and 'RestrictedChars': http://www.w3.org/TR/xml11/#charsets
>
> JDOM takes a 1.0 perspective on Characters... which may be a problem, but
> it is not going to solve your issues even if it supports 1.1 chars.
>
> Rolf
>
>
>
>
> On Fri, 7 Sep 2012 08:45:33 -0700, Canadian Wilf <canwilf at gmail.com>
> wrote:
>> Then what is the proper mode:
>>
>> Element e = new Element("foo")
>>
>> Should I do this:
>>
>> e.setText(string_of_sanitized_data_with_illegal_characters_escaped);
>>
>> or
>>
>> e.setText(any_text);
>>
>>
>> Wilf
>>
>>
>> On Fri, Sep 7, 2012 at 6:05 AM, Michael Kay <mike at saxonica.com> wrote:
>>
>>>   No, that's all wrong. The contents of an unparsed entity are always an
>>> external resource, they are never part of a text or attribute node.
>>> Parsed
>>> entities do become part of the content, but they must always use the
> XML
>>> character set.
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>> On 07/09/2012 13:10, Canadian Wilf wrote:
>>>
>>> According to the xml 1.1 spec:
>>>
>>>   4 Physical Structures ...
>>>> [Definition: An *unparsed entity* is a resource whose contents may or
>>>> may not be text <http://www.w3.org/TR/xml11/#dt-text>, and if text,
> may
>>>> be other than XML. Each unparsed entity has an associated
>>>> notation<http://www.w3.org/TR/xml11/#dt-notation>,
>>>> identified by name. Beyond a requirement that an XML processor make
> the
>>>> identifiers for the entity and notation available to the application,
>>>> XML
>>>> places no constraints on the contents of unparsed entities.]
>>>
>>>
>>>
>>>   AND
>>>
>>>   Entities may be either parsed or unparsed. [Definition: The contents
> of
>>>> a *parsed entity* are referred to as its replacement
>>>> text<http://www.w3.org/TR/xml11/#dt-repltext>;
>>>> this text <http://www.w3.org/TR/xml11/#dt-text> is considered an
>>>> integral part of the document.]
>>>
>>> [Definition: An *unparsed entity* is a resource whose contents may or
> may
>>>> not be text <http://www.w3.org/TR/xml11/#dt-text>, and if text, may be
>>>> other than XML. Each unparsed entity has an associated
>>>> notation<http://www.w3.org/TR/xml11/#dt-notation>,
>>>> identified by name. Beyond a requirement that an XML processor make
> the
>>>> identifiers for the entity and notation available to the application,
>>>> XML
>>>> places no constraints on the contents of unparsed entities.]
>>>> Parsed entities are invoked by name using entity references; unparsed
>>>> entities by name, given in the value of *ENTITY* or *ENTITIES*
>>>>   attributes.
>>>
>>>
>>>
>>>   In the current JDOM version, Element method setText(string) and also
>>> addContent(CDATA) refuses text that contains illegal characters. It is
>>> treating the data provided as 'parsed' when it should by the spec be
>>> treating it as free content.
>>>
>>>   I understand:
>>>
>>>    1) The xml 1.1 spec defines a parsed entity as its 'replacement
> text'.
>>>
>>>   2) Replacement text' would refer to the actual textual makeup of a
>>> serialized Element, not the data an Element holds in a Text content
>>> element
>>>
>>>
>>>   Then, if the above is true, the current implementation is actually
> wrong
>>> to verify data.
>>>
>>>   I propose that JDOM stop verifying data set as Element text and CDATA
>>> and leave it to the xerces (or whatever) to make sure the document is
>>> proper 1.1.
>>>
>>>   Am I understanding everything correctly?
>>>
>>>   Thoughts?
>>>
>>>   ---------- Forwarded message ----------
>>> From: Canadian Wilf <canwilf at gmail.com>
>>> Date: Thu, Sep 6, 2012 at 9:52 PM
>>> Subject: XML 1.1 -- Please stab me with a dull knife and trample my
> dead
>>> body
>>> To: jdom-interest at jdom.org
>>>
>>>
>>> Hi All,
>>>
>>>   I just learned that in order to safely use JDOM2, I will need to
>>> sanitize my Element .setText(string) so that the parsed data does not
>>> contain verboten characters under the XML 1.1 spec.
>>>
>>>   I have an ascii processor and it needs to be able to use xml as a
>>> document format. Unfortunately, not all ascii is allowed in an Element
>>> text.
>>>
>>>   Stab me with a dull knife and trample my dead body. But ..... please
>>> please please don't make me sanitize all my data before putting it into
>>> XML
>>> Elements.
>>>
>>>   1) It makes my programming task much more cumbersome because I must
>>> ensure not to feed any of the new verboten and doomed ascii/UTF-8
>>> characters to store as xml text.
>>>
>>> 2) No one uses xml 1.1, do they?
>>>
>>>   3) It slows down the parsing (a very small amount) with all the
> element
>>> text checking.
>>>
>>>   Now that JDOM2 is xml 1.1 compatible, is there any turning back. Can
>>> this be undone?
>>>
>>>   Does everyone understand that their software will bust if data
> provided
>>> as text is not adhering to the new standard?
>>>
>>>   What about you? How do you deal with it when using the libraries?
>>>
>>>   Wilf
>>>
>>>
>>>
>>> _______________________________________________
>>> To control your jdom-interest
>>>
> membership:http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>>>
>>>
>>>
>>> _______________________________________________
>>> To control your jdom-interest membership:
>>> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>>>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>