[jdom-interest] Fwd: XML 1.1 -- Please stab me with a dull knife and trample my dead body

Fri Sep 7 13:22:29 PDT 2012

     Hello,

On Fri, Sep 7, 2012 at 3:17 PM, Canadian Wilf <canwilf at gmail.com> wrote:
> Let's focus on valid character data for xml. How to do this:
>
> String s = someRandomBytesNowAsString();

  Java Strings are not actually random bytes. The bytes are UTF-16, if
I remember correctly.

> Element e = new Element("random")
> e.setText(s) or e.addContent(new CDATA(s))
>
> Currently this will fail.

  Sorry, you lost me here. How will this fail? Will it throw an
exception? Or will it otherwise do something undesired?

  Maybe I'm missing something, but it sounds to me as if you are
referring to specs that apply to XML character streams and not to JDOM
objects.

     Take care
     Oliver

>.. Which seems wrong because I should be able to
> send whatever data I want as text  in xml content.
>
> What use is xml (1.0 or 1.1) if I cannot represent various data? Is the
> solution to make a custom escaper for my data?
>
> e.setText(encodeSpecial(s)) and decodeSpecial(e.getText())
>
> Crazy!
>
> Wilf
>
>
> On Fri, Sep 7, 2012 at 11:48 AM, Rolf Lear <jdom at tuis.net> wrote:
>>
>>
>> Hi Wilf.
>>
>> You are getting your wires crossed..... In your mail you referenced parsed
>> and external entities. These have nothing to do with PCDATA (parsed
>> character data - regular XML text), and CDATA (unparsed character data -
>> <![CDATA[ ... ]]> )
>>
>> Michael was answering your question based on the 'entities', where as you
>> want the details on the 'PCDATA' and the 'CDATA'.
>>
>> So, forget about the 'entity' references, and focus on the valid character
>> data for XML.
>>
>> The only difference between CDATA (character blocks between <![CDATA[  and
>> ]]> ) and PCDATA (element 'text'), is that the XML Parser will look for
>> '<' and '&' characters in PCDATA, but not in CDATA.
>>
>> With the correct escaping, all CDATA content can be expressed as PCDATA
>> content.
>>
>> This does not help you though, because not all Java 'char' characters are
>> valid Unicode characters, and thus not all chars are valid as either CDATA
>> or PCDATA.
>>
>> In XML 1.0 this distinction was clear.
>>
>> In XML 1.1 I am not certain how to interpret the difference between
>> 'Chars' and 'RestrictedChars': http://www.w3.org/TR/xml11/#charsets
>>
>> JDOM takes a 1.0 perspective on Characters... which may be a problem, but
>> it is not going to solve your issues even if it supports 1.1 chars.
>>
>> Rolf
>>
>>
>>
>>
>> On Fri, 7 Sep 2012 08:45:33 -0700, Canadian Wilf <canwilf at gmail.com>
>> wrote:
>> > Then what is the proper mode:
>> >
>> > Element e = new Element("foo")
>> >
>> > Should I do this:
>> >
>> > e.setText(string_of_sanitized_data_with_illegal_characters_escaped);
>> >
>> > or
>> >
>> > e.setText(any_text);
>> >
>> >
>> > Wilf
>> >
>> >
>> > On Fri, Sep 7, 2012 at 6:05 AM, Michael Kay <mike at saxonica.com> wrote:
>> >
>> >>  No, that's all wrong. The contents of an unparsed entity are always an
>> >> external resource, they are never part of a text or attribute node.
>> >> Parsed
>> >> entities do become part of the content, but they must always use the
>> XML
>> >> character set.
>> >>
>> >> Michael Kay
>> >> Saxonica
>> >>
>> >> On 07/09/2012 13:10, Canadian Wilf wrote:
>> >>
>> >> According to the xml 1.1 spec:
>> >>
>> >>  4 Physical Structures ...
>> >>> [Definition: An *unparsed entity* is a resource whose contents may or
>> >>> may not be text <http://www.w3.org/TR/xml11/#dt-text>, and if text,
>> may
>> >>> be other than XML. Each unparsed entity has an associated
>> >>> notation<http://www.w3.org/TR/xml11/#dt-notation>,
>> >>> identified by name. Beyond a requirement that an XML processor make
>> the
>> >>> identifiers for the entity and notation available to the application,
>> >>> XML
>> >>> places no constraints on the contents of unparsed entities.]
>> >>
>> >>
>> >>
>> >>  AND
>> >>
>> >>  Entities may be either parsed or unparsed. [Definition: The contents
>> of
>> >>> a *parsed entity* are referred to as its replacement
>> >>> text<http://www.w3.org/TR/xml11/#dt-repltext>;
>> >>> this text <http://www.w3.org/TR/xml11/#dt-text> is considered an
>> >>> integral part of the document.]
>> >>
>> >> [Definition: An *unparsed entity* is a resource whose contents may or
>> may
>> >>> not be text <http://www.w3.org/TR/xml11/#dt-text>, and if text, may be
>> >>> other than XML. Each unparsed entity has an associated
>> >>> notation<http://www.w3.org/TR/xml11/#dt-notation>,
>> >>> identified by name. Beyond a requirement that an XML processor make
>> the
>> >>> identifiers for the entity and notation available to the application,
>> >>> XML
>> >>> places no constraints on the contents of unparsed entities.]
>> >>> Parsed entities are invoked by name using entity references; unparsed
>> >>> entities by name, given in the value of *ENTITY* or *ENTITIES*
>> >>>  attributes.
>> >>
>> >>
>> >>
>> >>  In the current JDOM version, Element method setText(string) and also
>> >> addContent(CDATA) refuses text that contains illegal characters. It is
>> >> treating the data provided as 'parsed' when it should by the spec be
>> >> treating it as free content.
>> >>
>> >>  I understand:
>> >>
>> >>   1) The xml 1.1 spec defines a parsed entity as its 'replacement
>> text'.
>> >>
>> >>  2) Replacement text' would refer to the actual textual makeup of a
>> >> serialized Element, not the data an Element holds in a Text content
>> >> element
>> >>
>> >>
>> >>  Then, if the above is true, the current implementation is actually
>> wrong
>> >> to verify data.
>> >>
>> >>  I propose that JDOM stop verifying data set as Element text and CDATA
>> >> and leave it to the xerces (or whatever) to make sure the document is
>> >> proper 1.1.
>> >>
>> >>  Am I understanding everything correctly?
>> >>
>> >>  Thoughts?
>> >>
>> >>  ---------- Forwarded message ----------
>> >> From: Canadian Wilf <canwilf at gmail.com>
>> >> Date: Thu, Sep 6, 2012 at 9:52 PM
>> >> Subject: XML 1.1 -- Please stab me with a dull knife and trample my
>> dead
>> >> body
>> >> To: jdom-interest at jdom.org
>> >>
>> >>
>> >> Hi All,
>> >>
>> >>  I just learned that in order to safely use JDOM2, I will need to
>> >> sanitize my Element .setText(string) so that the parsed data does not
>> >> contain verboten characters under the XML 1.1 spec.
>> >>
>> >>  I have an ascii processor and it needs to be able to use xml as a
>> >> document format. Unfortunately, not all ascii is allowed in an Element
>> >> text.
>> >>
>> >>  Stab me with a dull knife and trample my dead body. But ..... please
>> >> please please don't make me sanitize all my data before putting it into
>> >> XML
>> >> Elements.
>> >>
>> >>  1) It makes my programming task much more cumbersome because I must
>> >> ensure not to feed any of the new verboten and doomed ascii/UTF-8
>> >> characters to store as xml text.
>> >>
>> >> 2) No one uses xml 1.1, do they?
>> >>
>> >>  3) It slows down the parsing (a very small amount) with all the
>> element
>> >> text checking.
>> >>
>> >>  Now that JDOM2 is xml 1.1 compatible, is there any turning back. Can
>> >> this be undone?
>> >>
>> >>  Does everyone understand that their software will bust if data
>> provided
>> >> as text is not adhering to the new standard?
>> >>
>> >>  What about you? How do you deal with it when using the libraries?
>> >>
>> >>  Wilf
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> To control your jdom-interest
>> >>
>>
>> membership:http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> To control your jdom-interest membership:
>> >> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
>> >>
>
>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com

-- 
Scientific Developer at PanGenX (http://www.pangenx.com)

"Stagnation and the search for truth are always opposites." - Nadezhda
Tolokonnikova