[jdom-interest] BUG: XMLOutputter inserts extra empty lines

Bradley S. Huffman hip at a.cs.okstate.edu
Tue Nov 27 18:00:23 PST 2001


> This just goes to prove the adage that all whitespace handling in XML is a
> pain.

Yes it is!

This post goes to the question of what does newlines really imply?  It's
sounds easy at first.  We just use newlines, normalize, and indent to take:

  <payroll>
  <employee><firstname>     Brad</firstname><lastname>Huffman</lastname>
  </employee>
  <employee><firstname>John     </firstname>
  <lastname>Doe</lastname> </employee>
  </payroll>

To make it aesthetically pleasing as in:

  <payroll>
      <employee>
          <firstname>Brad</firstname>
          <lastname>Huffman</lastname>
      </employee>
      <employee><firstname>John</firstname>
          <lastname>Doe</lastname>
      </employee>
  </payroll>

But there are 4 situations where can have text content, between
<start><start>, <start></end>, </end><start>, and </end></end> tags.
For the most common case of short text content between a start and end tag
a single line is what we want case it looks best.

          <firstname>Brad</firstname>
          <lastname>Huffman</lastname>

But then in cases like:

  </employee>
                    Some       randomly spaced  text    <employee>

With newlines ON, how should this be printed?  As is? With
leading/trailing whitespace trimmed? With Leading/trailing whitespace
trimmed and text aligned with </employee> or <employee>? Something else?
It all depends on how newlines is defined.

With newlines ON and normalize ON, currently leading/trailing
whitespace are semantically insignificant and we are free to add/remove
them to produced the desired alignment (hmmm, that's not quite true).  But
what if normalization is OFF.  Should leading/trailing whitespace be
insignificant in all cases, in some cases? It gets confusing quickly!

Right now it seems text content between tags is insignificant
ONLY if it is empty or all whitespace (when newlines is ON).  Which
means the example above will be printed "as is" with newlines ON and
normalization OFF, which is kind of ugly for a pretty-print mode.  Even
with normalization ON the last tag <employee> will be printed on the same
line as the text while it's corresponding end tag (assume non-empty content)
is align with the previous end tag, again ugly IMHO.

Hmmm, the above paragraph isn't really true either. Try setting newlines
to true, indent to "xxxx" (so you can see where indentation is add), and
normalize to false. You'll get a line separator and indentation after text
that is empty, and before and after text that is all whitespace. Very weird
behavior.

After careful thought, I purpose undeprecating textTrim and defining the 
following modes for XMLOutputter. Basically using the premise that turning
newlines ON means we care more about how it looks than the semantic meaning
of whitespace.  For the most part everything stays the same as what we have
now (or would expect to have) except for the cases with text between
<tag><tag>, </tag><tag>, or </tag></tag> when newlines and
trimming/normalizing is on.

     Default: 
          No content is added to or removed from a element's content.

     textTrim:
          Leading/trailing whitespace are insignificant and can be removed.
          With newlines ON, whitespace might be added back to fit alignment
          needs.

     textNormalize:
          Same as textTrim, but interior whitespace is compressed to
          a single space.

     newlines (textTrim and textNormalize OFF):
          Empty content or "whitespace ONLY" content between tags is
          insignificant. Text content that contains one or more non-whitespace 
          characters are left untouched and no leading/trailing whitespace
          are added/removed.

     newlines (textTrim or textNormalize ON):
         Case of <tag>text</tag>:
              Start tag, text, and end tag are printed on single line with
              trimming/normalization of text.

         Case of <tag>text<tag>, </tag>text<tag>, </tag>text</tag>:
              Start tag, text, and end tag are aligned. Text is trimmed/
              normalized before alignment.

Some other possible modes might be:

    canonical:
         See http://www.w3.org/TR/xml-c14n.  Even though I think it would
         be better to have a converter to transform the Document itself
         XMLOutputter is already close to outputting in canonical form it
         might be worth it to have both.

    line wrap or text wrap?
         wrap a line after so many chars, or maybe just wrap text.
         Might help with some HTML/XHTML, or might this functionality be
         better left to something like HTML Tidy.

    alignText:
        Treat all text content like tags and align them. Example
        <name>Bradley S. Huffman</name> could become:

             <name>
                 Bradley S. Huffman
             </name>
         
And the possibilities go on and on. Feedback?

Brad



More information about the jdom-interest mailing list