[jdom-interest] Cleaning up EOS marker

Szegedi, Attila szegedi at scriptum.hu
Tue Jun 26 06:14:13 PDT 2001


Excuse me for being grudgy on this list about this topic, but here it goes:

Given that Java and XML *actually* support Unicode, filtering out all
characters  > 126 and calling it a "cleanup" is a BAD practice that will
make your application unusable as soon as a document containing a non-ASCII
character comes along. This kind of Unicode ignorance rears its head quite
often in environments where people believe that the english alphabet is the
only alphabet. I remember reading several months ago in a column of a
technical magazine that should remain unnamed (altough you could easily
guess it's published in US) that "You can handle UTF-8 as if it were ASCII,
just skip the first three bytes").

XML spec and conforming applications pay attention to handling international
characters (as far as an XML document with no encoding specification is
considered UTF-8 encoded by the spec), so I felt these few lines of
Unicode-advocacy are in place here.

Attila.


> -----Original Message-----
> From: jdom-interest-admin at jdom.org
> [mailto:jdom-interest-admin at jdom.org]On Behalf Of Paul Philion
> Sent: Tuesday, June 26, 2001 2:41 PM
> To: jdom-interest at jdom.org
> Subject: [jdom-interest] Cleaning up EOS marker
>
>
> Tim -
>
> Here's my "CleanUpInputStream". Nothing fancy, but it works for me. I
> recommend wrapping a BufferedInputStream around the input
> before this...
>
> InputStream in = new CleanUpInputStream(new BufferedInputStream(new
> FileInputStream("fileName")));
>
> ----
>
> public class CleanUpInputStream extends FilterInputStream {
>  public CleanUpInputStream(InputStream in) {
>   super(in);
>  }
>
>  public int read() throws IOException {
>   int ch = in.read();
>   if (ch == 10 || ch == 13 || ch == 9 || ch == -1) {
>    return ch;
>   }
>   else if (ch < 32 || ch > 126) {
>    return ' ';
>   }
>   return ch;
>  }
>
>  public int read(byte[] data, int offset, int length) throws
> IOException {
>   int result = in.read(data, offset, length);
>   for (int i = offset; i < offset + length; i++) {
>    int ch = data[i];
>    if (ch == 10 || ch == 13 || ch == 9 || ch == -1) {
>     // nothing
>    }
>    else if (ch < 32 || ch > 126) {
>     data[i] = (byte)' ';
>    }
>   }
>   return result;
>  }
> }
>
>




More information about the jdom-interest mailing list