From jdom at tuis.net Thu Mar 1 07:50:16 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 01 Mar 2012 10:50:16 -0500 Subject: [jdom-interest] JDOM2 BETA 2 Released - 0.0.2 - Leap Day - and future version numbering In-Reply-To: <4F4EF92D.3050205@tuis.net> References: <4F4EF92D.3050205@tuis.net> Message-ID: <6f35c7ffd2a0d68cca643f8292331648@tuis.net> And there's the maven artifact: http://search.maven.org/#artifactdetails|org.jdom|jdom2|0.0.2-BETA|jar I would appreciate it if people could think about the version numbers of JDOM2. The situation is a little more complciated when you consider maven, but, in essence, I think I would like to keep JDOM and JDOM2 as different artifacts in maven. I don't want people with 'liberal' specifications for maven artifacts to suddenly download JDOM2 in to their project if I add the new JDOM2 version to the same jdom artifact. So, the logical thing to do is create the jdom2 artifact, and put JDOM2 content in there. This is what I did with this BETA release. But, should I call the upcoming JDOM2 release 'JDOM2 version 1.0.0' or should I call it 'JDOM 2.0.0' I think I will continue to 'isolate' the JDOM2 maven deployment as the 'jdom2' artifact, but even that is debatable. Funny how such a small detail can be so complicated ... ;-) I think I am 'tending' to want to release the JDOM2 release as jdom-2.0.0, but I will do the maven deploy to the artifact 'jdom2'. Does anyone have any suggestions, questions about this? Rolf On Wed, 29 Feb 2012 23:21:01 -0500, Rolf Lear wrote: > Happy "Leap Day" > > JDOM2's second Beta release is available. > > Like previous releases, this is available for download from github. > https://github.com/hunterhacker/jdom/downloads > > Specifically, the second beta release is: > https://github.com/downloads/hunterhacker/jdom/jdom2-0.0.2-BETA.zip > > The Github Javadoc, JUnit, and Coverage pages have all been updated to > match this new BETA release: > https://github.com/hunterhacker/jdom/wiki/JDOM-2.0#wiki-links > > A few things are different this time: > - I have changed the 'naming' convention - now jdom2-0.0.2-BETA. This is > to satisfy maven-central. > - speaking of which, I am releasing this to the 'jdom2' artifact on > maven-central. Expect the jdom2 artifact to arrive in a few hours From jdom at tuis.net Thu Mar 1 16:36:39 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 01 Mar 2012 19:36:39 -0500 Subject: [jdom-interest] JDOM2 performance Message-ID: <4F501617.9020101@tuis.net> Hi all. As an exercise in a few things, I have run a number of comparative performance tests, comparing JDOM2 (Beta 2) against JDOM 1.1.3. I put together a performance 'test harness' to track the changes in the performance metrics for JDOM2, and I have taken that harness, stripped out those things which are not available in JDOM 1.1.3, and then run the system on both code bases. I ran the tests in Java5, Java6, and Java7. I took the performance harness cloned it for JDOM 1.1.3, and 'backported' it. I did the same types of changes to the JDOM2 harness to make them equivalent. I compiled both harnesses with Java6 using a Java5 class target. For the JDOM 1.1.3 test I linked in the JDOM 1.1.3 jar (which is compiled with Java5, and targets Java 1.2). For the JDOM2 test I linked in the Beta2 code (compiled with Java6 targetting Java 5). I then took those code bases, and ran them using the Java5, 6, and 7 JRE's. YOu can see the results here: http://hunterhacker.github.com/jdom/jdom2/performanceJDKBeta2.html From that page you can see a number of interesting things: firstly, the XPath expression '//.' is much, much, much faster than '//node()' (in jaxen). The second item is the jump in memory footprint from Java5 to Java6 You can see that most JDOM2 operations are slightly slower than JDOM 1.1.3, except XPath processing which is much, much faster. At face value it would seem the slower performance is all related to a slower JDOM class initialization.... I will look in to that. I figured the results were interesting, though, and there may be some benefit for others. Rolf From jdom at tuis.net Sun Mar 11 14:17:21 2012 From: jdom at tuis.net (Rolf Lear) Date: Sun, 11 Mar 2012 17:17:21 -0400 Subject: [jdom-interest] Resolver announcement Message-ID: <4F5D1661.7030608@tuis.net> Hi all. way-back-when... about July last year, I ran in to a problem resolving documents against w3x resources. Essentially the problem is described here: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ I thought JDOM was a good location for building a solution to this problem. I even created an 'issue' for it... https://github.com/hunterhacker/jdom/issues/26 I later decided that JDOM was not necessarily the correct place to solve that problem, so I 'rejected' that issue. But I have still been perplexed by this problem for a while now, and I have taken some time in the past few weeks to tackle the problem, and perhaps come up with a solution. Thus, I invite anyone interested to have a look at: https://github.com/rolfl/Resolver This project has the 'simple' purpose of behaving very much like a caching proxy server for HTTP documents and exposing the cache as an EntityResolver useful for SAX and other parsing. I decided to tackle the hard parts first - how do you build a file-based cache in a multithreaded system, with the added complexity that it needs to be accessible from multiple JVM's, not just threads within one JVM. I figure that the code is too 'immature' to call 'stable', and it is not a great fit for JDOM (since the solution has no code shared with anything in JDOM, and it does not even process any XML...). So, releasing it as part of JDOM2 is not appropriate, but its usefulness is significant. So, if anyone is interested, I am eager to get some input on it... I think an attempt to make an 'easy to use' system for entity resolving would be a benefit for the entire Java community... A ssytem that allows you to plug in a combination of in-memory cached entities, combined with on-disk 'catalog' systems (perhaps leveraging the xerces 'Resolver' project, then this 'Resolver' for caching non-catalog resources, finally a fall through to more traditional URL-based resolvers would be ideal. Thanks Rolf From mike at saxonica.com Sun Mar 11 15:32:20 2012 From: mike at saxonica.com (Michael Kay) Date: Sun, 11 Mar 2012 22:32:20 +0000 Subject: [jdom-interest] Resolver announcement In-Reply-To: <4F5D1661.7030608@tuis.net> References: <4F5D1661.7030608@tuis.net> Message-ID: <4F5D27F4.9050602@saxonica.com> In Saxon 9.4 I have addressed this problem by including a copy of the most common resources within the Saxon JAR file, and ensuring that when Saxon itself allocates the XMLReader, it uses an EntityResolver that grabs these local copies of resources when available. But Saxon isn't architecturally the right place for the solution, any more than JDOM is. I like the idea of a caching resolver: except that surely, the best way to offer this to the world is as an implementation of XMLReader that wraps an underlying XMLReader with a caching entity resolver. Then anyone who picks up this XMLReader implementation will automatically get the caching behaviour - even if they implement their own EntityResolver on top. But I think a variant of the caching resolver that only uses a pre-initialized cache containing the common W3C files, and doesn't attempt any dynamic caching, might be even more useful, because it would avoid needing access to writable filestore, and the synchronization and permissions issues that this introduces. Such a beast could easily be carved out of the existing Saxon code and turned into a freestanding component. Michael Kay Saxonica On 11/03/2012 21:17, Rolf Lear wrote: > Hi all. > > way-back-when... about July last year, I ran in to a problem resolving > documents against w3x resources. Essentially the problem is described > here: > > http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ > > I thought JDOM was a good location for building a solution to this > problem. I even created an 'issue' for it... > https://github.com/hunterhacker/jdom/issues/26 > > I later decided that JDOM was not necessarily the correct place to > solve that problem, so I 'rejected' that issue. > > But I have still been perplexed by this problem for a while now, and I > have taken some time in the past few weeks to tackle the problem, and > perhaps come up with a solution. > > Thus, I invite anyone interested to have a look at: > https://github.com/rolfl/Resolver > > This project has the 'simple' purpose of behaving very much like a > caching proxy server for HTTP documents and exposing the cache as an > EntityResolver useful for SAX and other parsing. > > I decided to tackle the hard parts first - how do you build a > file-based cache in a multithreaded system, with the added complexity > that it needs to be accessible from multiple JVM's, not just threads > within one JVM. > > I figure that the code is too 'immature' to call 'stable', and it is > not a great fit for JDOM (since the solution has no code shared with > anything in JDOM, and it does not even process any XML...). So, > releasing it as part of JDOM2 is not appropriate, but its usefulness > is significant. > > So, if anyone is interested, I am eager to get some input on it... > > I think an attempt to make an 'easy to use' system for entity > resolving would be a benefit for the entire Java community... A ssytem > that allows you to plug in a combination of in-memory cached entities, > combined with on-disk 'catalog' systems (perhaps leveraging the xerces > 'Resolver' project, then this 'Resolver' for caching non-catalog > resources, finally a fall through to more traditional URL-based > resolvers would be ideal. > > Thanks > > Rolf > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > From jdom at tuis.net Sun Mar 11 16:23:30 2012 From: jdom at tuis.net (Rolf Lear) Date: Sun, 11 Mar 2012 19:23:30 -0400 Subject: [jdom-interest] Resolver announcement In-Reply-To: <4F5D27F4.9050602@saxonica.com> References: <4F5D1661.7030608@tuis.net> <4F5D27F4.9050602@saxonica.com> Message-ID: <4F5D33F2.7010603@tuis.net> On 11/03/2012 6:32 PM, Michael Kay wrote: > In Saxon 9.4 I have addressed this problem by including a copy of the > most common resources within the Saxon JAR file, and ensuring that when > Saxon itself allocates the XMLReader, it uses an EntityResolver that > grabs these local copies of resources when available. But Saxon isn't > architecturally the right place for the solution, any more than JDOM is. > > I like the idea of a caching resolver: except that surely, the best way > to offer this to the world is as an implementation of XMLReader that > wraps an underlying XMLReader with a caching entity resolver. Then > anyone who picks up this XMLReader implementation will automatically get > the caching behaviour - even if they implement their own EntityResolver > on top. > > But I think a variant of the caching resolver that only uses a > pre-initialized cache containing the common W3C files, and doesn't > attempt any dynamic caching, might be even more useful, because it would > avoid needing access to writable filestore, and the synchronization and > permissions issues that this introduces. > > Such a beast could easily be carved out of the existing Saxon code and > turned into a freestanding component. > > Michael Kay > Saxonica > I had considered that having a 'commonly used' repository of files would be an option, but that is very, very close to being a 'catalog', and that is solved. The problem I have run in to (often), is the availability of the resource data... when you need it.... and I have found that more now with JDOM maintenance. I think the 'opportunity' for improvement is not in making a new web catalog, but in making an updatable 'catalog'. Existing catalog systems are not thread-safe for update, and that is the exact functionality I think is needed. As for how the 'cache' is presented, I think that it will be a case of 'wrapping' it in any number of ways to be useful... but, at the lowest level, it is just an EntityResolver, so I intended to start with that. One 'novel' idea I have is that the cache is fully 'zippable', and it would be relatively trivial to zip up the cache, and unzip it in anothe location, and 'seed' a system.... and, perhaps I can make a 'zip' of the entire w3c.org resources.... which is what would be very useful... I have asked w3c for a complete 'catalog' of their resources, and there is none available.... Similarly, it would be relatively trivial to converts the cache in to a 'catalog' that disregards the 'expires' time for these resources. Specifically for the XMLReader suggestion, I think the proposed solution I have is specific enough to be out-of-shape with anything out there... for a start, it only resolves http(s) resources.... It needs to be a 'small' part of a bigger system. I think the idea I have at the moment is to see how this functionality could fit in with other systems.... From what I can tell, there is no available system that does a local 'dynamic' cache of web-based resources. I think that is the 'gap' that needs to be solved.... Obviously, I could be completely wrong about that ... ;-) Rolf From jdom at tuis.net Mon Mar 12 17:52:28 2012 From: jdom at tuis.net (Rolf Lear) Date: Mon, 12 Mar 2012 20:52:28 -0400 Subject: [jdom-interest] JDOM BETA 3 Message-ID: <4F5E9A4C.60403@tuis.net> Happy "March 12" JDOM2's third, and hopefully final Beta release is available. Like previous releases, this is available for download from github. https://github.com/hunterhacker/jdom/downloads Specifically, the second beta release is: https://github.com/downloads/hunterhacker/jdom/jdom2-0.0.3-BETA.zip The Github Javadoc, JUnit, and Coverage pages have all been updated to match this new BETA release: https://github.com/hunterhacker/jdom/wiki/JDOM-2.0#wiki-links A few things are different this time: 1. Added 'Specified' flag to attribute: https://github.com/hunterhacker/jdom/wiki/JDOM2-Feature:-Attribute-Specified 2. Moved the TextHelper code to core. 3. AbstractFilter is now public (not package private). Please take this Beta for a spin, and share your experiences... good and bad. I believe that this will be the final BETA release. The final JDOM2 release will be over the Easter period. Have fun! Rolf From jdom at tuis.net Thu Mar 15 16:03:31 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 15 Mar 2012 19:03:31 -0400 Subject: [jdom-interest] Possible bug in JDom2 In-Reply-To: <4F626651.1010705@tuis.net> References: <4F626651.1010705@tuis.net> Message-ID: <4F627543.2010700@tuis.net> Hi Craig. I think my previous reply was a bit hurried... thinking about it more, I get the following... - It would be nice to return 'Iterable', because that would make the 'enhanced' for possible.... but how important is the enhanced for? - it would require a new method (actually two) in org.jdom2.Parent - it would also be nice to have a method getSelfAndDescendants() - but that would be complicated on Document, because 'self' is not Content (it's Document). So, I think the idea of getSelfAndDescendants is nice, but not viable. I think the idea of a new method returning an Iterable is nice, and it is viable, but is it necessary? I am not convinced.... yet. The "enhanced for" (for-each) loop is just a convenience. It technically does not add any new functionality (other than a 'simpler' line of code). Is it really that much harder to do: for (Iterator it = element.getDescendants(filter); it.hasNext(); ) { Element e = it.next(); // do something. } than for (Element e : element.getDescendantIterable(filter)) { // do something } Actually, looking at the above, it is a significant difference, I guess. Ok, I am more convinced than before... maybe it would be useful, but then it will be messy too to have the 'old' getDescendants() methods too. I will think about it some more... if there was a 'logical' way to express these new methods (good, meaningful, non-confusing names) it would make it an easier decision... What would be a good name? While I have your code example 'in mind' I thought I would point out a couple of other things.... Have you seen there is the org.jdom2.Filters class? It makes some other lines simpler too. Instead of: ElementFilter tableTilter = new ElementFilter("Table"); you can do: Filter tableFilter = Filters.element("Table"); Well, that's not exactly better... the exact same number of characters.... oh, it does make a difference if you do not 'keep' the tableFilter instance... Iterator it = root.getDescendants(Filters.element("Table")); Another thing, in your example you could possibly consider an XPath... XPathExpression xp = XPathFactory.compile("//Table") int tablesize = xp.evaluate(root).size(); If you want the results as a list of Element: XPathExpression xp = XPathFactory.compile( ".//Table", Filters.element()) int tablesize = xp.evaluate(root).size(); Anyway, there is some food for thought in all of this. Rolf On 15/03/2012 5:59 PM, Rolf Lear wrote: > Hi Craig. > > getDescendants returns an Iterator not an Iterable > > Now that I think about it, it is a mess, but, that's because JDOM 1.x > returned an iterator. > > Technically your code should be: > > for (Iterator it =root.getDescendants(tableFilter); > it.hasNext(); ) { > tableCount++; > } > > I wonder whether I can make an 'Iterable' return value too.... it makes > sense to, but I can't change the current return value for getDescendants > without breaking compatibility... > > > suggestions? > > Rolf > > > > On 15/03/2012 3:59 PM, Craig Noah wrote: >> I've downloaded the latest JDom2 beta today and am working to >> incorporate it into some new code. I am developing against Java6, so I >> would expect iterators to work. However, the following code fails to >> compile (with JDom2 includes): >> >> SAXBuilder sax = new SAXBuilder(); >> Document xml = sax.build (source); // source is a File object >> Element root = xml.getRootElement(); >> ElementFilter tableFilter = new ElementFilter ("Table"); >> int tableCount = 0; >> for (Element table : root.getDescendants( >> tableFilter)) { >> tableCount++; >> } >> >> The compile-time error that I get states, "Can only iterate over an >> array or an instance of java.lang.Iterable". Since >> Element.getDescendants (Filter) returns a java.util.Iterator, I >> would expect my code to compile and work. What am I missing? >> >> Sincerely, >> Craig >> >> >> _______________________________________________ >> To control your jdom-interest membership: >> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > > > > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com From thomas.scheffler at uni-jena.de Thu Mar 15 23:38:02 2012 From: thomas.scheffler at uni-jena.de (Thomas Scheffler) Date: Fri, 16 Mar 2012 07:38:02 +0100 Subject: [jdom-interest] Possible bug in JDom2 In-Reply-To: <4F627543.2010700@tuis.net> References: <4F626651.1010705@tuis.net> <4F627543.2010700@tuis.net> Message-ID: <4F62DFCA.3020703@uni-jena.de> Am 16.03.2012 00:03, schrieb Rolf Lear: > Hi Craig. > > I think my previous reply was a bit hurried... thinking about it more, I > get the following... > > - It would be nice to return 'Iterable', because that would make the > 'enhanced' for possible.... but how important is the enhanced for? > - it would require a new method (actually two) in org.jdom2.Parent > - it would also be nice to have a method getSelfAndDescendants() > - but that would be complicated on Document, because 'self' is not > Content (it's Document). Hi Rolf, my suggestion: create an Interface IterableIterator that implements both interfaces (Iterable, Iterator) and make this the new return type. This will keep the compatibility and allows those fancy for loops. Regards Thomas From jdom at tuis.net Fri Mar 16 02:47:29 2012 From: jdom at tuis.net (Rolf Lear) Date: Fri, 16 Mar 2012 05:47:29 -0400 Subject: [jdom-interest] Possible bug in JDom2 In-Reply-To: <4F62DFCA.3020703@uni-jena.de> References: <4F626651.1010705@tuis.net> <4F627543.2010700@tuis.net> <4F62DFCA.3020703@uni-jena.de> Message-ID: <4F630C31.3040305@tuis.net> That seems almost too easy.... I can see it being very easy to do in this particular use case.... I will have a good look. It will be the documentation that's hardest to do, which is fine.... Thanks Rolf On 16/03/2012 2:38 AM, Thomas Scheffler wrote: > Am 16.03.2012 00:03, schrieb Rolf Lear: >> Hi Craig. >> >> I think my previous reply was a bit hurried... thinking about it more, I >> get the following... >> >> - It would be nice to return 'Iterable', because that would make the >> 'enhanced' for possible.... but how important is the enhanced for? >> - it would require a new method (actually two) in org.jdom2.Parent >> - it would also be nice to have a method getSelfAndDescendants() >> - but that would be complicated on Document, because 'self' is not >> Content (it's Document). > > Hi Rolf, > > my suggestion: > > create an Interface IterableIterator that implements both interfaces > (Iterable, Iterator) and make this the new return type. This will keep > the compatibility and allows those fancy for loops. > > Regards > > Thomas > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > From jdom at tuis.net Fri Mar 16 04:57:31 2012 From: jdom at tuis.net (Rolf Lear) Date: Fri, 16 Mar 2012 07:57:31 -0400 Subject: [jdom-interest] Possible bug in JDom2 In-Reply-To: <4F630C31.3040305@tuis.net> References: <4F626651.1010705@tuis.net> <4F627543.2010700@tuis.net> <4F62DFCA.3020703@uni-jena.de> <4F630C31.3040305@tuis.net> Message-ID: So, it is done... and committed. It came up being even easier than I thought. Thanks for the suggestion. And it does look better 'in practice'. I have a few documentation type changes pending, as well as a tidy-up of the 'AndFilter' code. Togerther with this change I think there is enough reason to push out a new Beta build later today (when I get home). Rolf On Fri, 16 Mar 2012 05:47:29 -0400, Rolf Lear wrote: > That seems almost too easy.... > > I can see it being very easy to do in this particular use case.... I > will have a good look. It will be the documentation that's hardest to > do, which is fine.... > > Thanks > > Rolf > From jdom at tuis.net Fri Mar 16 20:44:01 2012 From: jdom at tuis.net (Rolf Lear) Date: Fri, 16 Mar 2012 23:44:01 -0400 Subject: [jdom-interest] JDOM2 Beta 4 Released Message-ID: <4F640881.8030207@tuis.net> Hi again everyone. Given the two structural changes I have made recently it seems appropriate to release another Beta. JDOM2 Beta 4 is now available from the usual places: https://github.com/hunterhacker/jdom/downloads and maven central under group org.jdom artifact jdom2 and version 0.0.4 (give it an hour or so....). The JavaDoc, Code Coverage, JUnit Test Results, and Performance details are all available at their regular locations: http://hunterhacker.github.com/jdom/jdom2/apidocs/index.html http://hunterhacker.github.com/jdom/jdom2/coverage/index.html http://hunterhacker.github.com/jdom/jdom2/junit.report/index.html http://hunterhacker.github.com/jdom/jdom2/performance.html There are a few significant changes in this release (API changes, in fact, which I was hoping to avoid in the beta cycles) - Filter.and(Filter) now returns a typed Filter, instead of Filter - Parent.getDescendants() now returns an instance which can be treated as either an Iterator (fully compatible with the JDOM 1.x behaviour) or an Iterable (useful for enhanced-for loops). - I have cleaned up the org.jdom2.input package and moved some StAX-specific helper classes to org.jdom2.input.stax. I have also changed the way that the StAX classes access some StAX constants by doing a static import instead of an 'implements'. I do not believe the StAX changes will be a 'big deal' for anyone. - I have done a lot of tidying up on the JavaDoc It appears that there are a number of people playing with JDOM2 based on the downloads from GitHib. By pushing out additional Beta releases it makes it easier for these people to test the latest code changes. Please continue to get a feel for JDOM2, and do not hesitate to ask questions, express concerns, and offer suggestions. In fact, if there is any way I can make it easier for you to get this latest code to try it out, please speak up! Have a great St. Patricks day Rolf From jdom at tuis.net Sat Mar 17 06:01:11 2012 From: jdom at tuis.net (Rolf Lear) Date: Sat, 17 Mar 2012 09:01:11 -0400 Subject: [jdom-interest] Why are version numbers so complicated? Message-ID: <4F648B17.4010502@tuis.net> Hi all I am getting (trying to get) things all nicely tidied up, organized, and automated for the JDOM2 releases. But, one thing I have not yet sorted out is the version number(s) for the 'final' JDOM2 release. Here are the factors that influence the decision: - the www.jdom.org site is the official release site for all things JDOM. - The www.jdom.org site is going to need to have both versions (1.x and 2.x) available simultaneously. - maven has version number requirements - maven has some automated processing for dependency management - Technically the Java package is org.jdom2, not org.jdom - 'in my head' I have JDOM 1.x and JDOM2 - there is an established tradition for JDOM 1.x - there is already some sort of 'consistency' for JDOM2 - I anticipate there to be relatively routine releases for JDOM2 as new features are added and existing bugs fixed. I want it to be easy for new versions to be pushed out, and I want to be able to tell people 'just get the latest version' if there is a problem. About the maven requirements: maven has a hierarchy of resources. At the top of the hierarchy is a 'group'. We are the 'org.jdom' group. Each group releases 'things', which in maven speak is an 'artifact'. Each artifact has versions. For example, there is the group 'org.jdom', with the artifacts 'jdom', and 'jdom-contrib', and it so happens that there are the jdom versions 1.1.2 and 1.1.3, as well as the jom-contrib version 1.1.3. There is no special 'maven' reason for the version number jdom 1.1.3 to match jdom-contrib 1.1.3. I have also recently added the jdom2 artifact id for the last couple of JDOM2 beta releases. The reason I added jdom2 is because maven dependencies can be automated, where maven users can say 'I want to use the latest version of org.jdom artifact 'jdom'. I don't want people who expect to use JDOM 1.x to suddenly start getting JDOM2. But, I also needed to test whether I can do the releases to maven, and to make sure that the releases work. Finally, maven has an ordering for versions. It is logical, and systematic, but it means that, if we want to use use the two artifacts (jdom and jdom2) that I have already created, we need to keep in mind that the first available versions for them are 1.1.4 and 0.0.5 respectively. So, thinking ahead to the first full JDOM2 release, should it be: JDOM version 2 JDOM version 2.0.0 JDOM2 version 1 JDOM2 version 1.0.0 JDOM2 version 2 JDOM2 verison 2.0.0 Additionally, should I push the JDOM2 release out to the 'jdom' artifact on maven-central, or should I push it to the 'jdom2' artifact? (it affects the options for version numbers). Anyway, the point is that I am wholly uncertain as to what the 'right' answer to this is. Does anyone have any suggestions, notice anything I have missed, etc? Thanks Rolf From paul at hoplahup.net Sat Mar 17 06:22:06 2012 From: paul at hoplahup.net (Paul Libbrecht) Date: Sat, 17 Mar 2012 14:22:06 +0100 Subject: [jdom-interest] Why are version numbers so complicated? In-Reply-To: <4F648B17.4010502@tuis.net> References: <4F648B17.4010502@tuis.net> Message-ID: <32F0C90E-5B84-45FC-B8E7-AB9B3CE6877F@hoplahup.net> Rolf, Your approach is almost perfect I think. The jdom2 artifactId is useful and correct. The rest is all a matter of ordering right? isn't maven saying that 2 < 2.1 ? So the first version of 2 should just be 2? That would be my guess. paul Le 17 mars 2012 ? 14:01, Rolf Lear a ?crit : > Hi all > > I am getting (trying to get) things all nicely tidied up, organized, and automated for the JDOM2 releases. But, one thing I have not yet sorted out is the version number(s) for the 'final' JDOM2 release. > > Here are the factors that influence the decision: > - the www.jdom.org site is the official release site for all things JDOM. > - The www.jdom.org site is going to need to have both versions (1.x and 2.x) available simultaneously. > - maven has version number requirements > - maven has some automated processing for dependency management > - Technically the Java package is org.jdom2, not org.jdom > - 'in my head' I have JDOM 1.x and JDOM2 > - there is an established tradition for JDOM 1.x > - there is already some sort of 'consistency' for JDOM2 > - I anticipate there to be relatively routine releases for JDOM2 as new features are added and existing bugs fixed. I want it to be easy for new versions to be pushed out, and I want to be able to tell people 'just get the latest version' if there is a problem. > > About the maven requirements: maven has a hierarchy of resources. At the top of the hierarchy is a 'group'. We are the 'org.jdom' group. Each group releases 'things', which in maven speak is an 'artifact'. Each artifact has versions. For example, there is the group 'org.jdom', with the artifacts 'jdom', and 'jdom-contrib', and it so happens that there are the jdom versions 1.1.2 and 1.1.3, as well as the jom-contrib version 1.1.3. There is no special 'maven' reason for the version number jdom 1.1.3 to match jdom-contrib 1.1.3. > > I have also recently added the jdom2 artifact id for the last couple of JDOM2 beta releases. The reason I added jdom2 is because maven dependencies can be automated, where maven users can say 'I want to use the latest version of org.jdom artifact 'jdom'. I don't want people who expect to use JDOM 1.x to suddenly start getting JDOM2. But, I also needed to test whether I can do the releases to maven, and to make sure that the releases work. Finally, maven has an ordering for versions. It is logical, and systematic, but it means that, if we want to use use the two artifacts (jdom and jdom2) that I have already created, we need to keep in mind that the first available versions for them are 1.1.4 and 0.0.5 respectively. > > So, thinking ahead to the first full JDOM2 release, should it be: > > JDOM version 2 > JDOM version 2.0.0 > JDOM2 version 1 > JDOM2 version 1.0.0 > JDOM2 version 2 > JDOM2 verison 2.0.0 > > Additionally, should I push the JDOM2 release out to the 'jdom' artifact on maven-central, or should I push it to the 'jdom2' artifact? (it affects the options for version numbers). > > Anyway, the point is that I am wholly uncertain as to what the 'right' answer to this is. Does anyone have any suggestions, notice anything I have missed, etc? > > Thanks > > Rolf > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com From Thomas.Dehaene at howest.be Sat Mar 17 08:14:42 2012 From: Thomas.Dehaene at howest.be (Dehaene Thomas) Date: Sat, 17 Mar 2012 15:14:42 +0000 Subject: [jdom-interest] Passing Element parameter in Jax-ws web method Message-ID: Hello, I have a problem when accepting a web parameter of the type Element. My code is as follows: @WebMethod(operationName = "hello") public void hello(@WebParam(name = "xmlElement") Element element){ String name = new String(); name = element.getChild("FirstName").getValue(); } Which generates an error when I try to deploy the web service. Namely that the Element class does not have a no-arg constructor. If this where a normal method, I would simply make it a 'public static void' method, but since WebMethods in Jax-ws can't be of the 'final' or 'static' type, I can't seem to accept an Element parameter. Any suggestions how to solve this? Many thanks and greetings Thomas Dehaene From jhunter at servlets.com Sat Mar 17 08:45:13 2012 From: jhunter at servlets.com (Jason Hunter) Date: Sat, 17 Mar 2012 10:45:13 -0500 Subject: [jdom-interest] Why are version numbers so complicated? In-Reply-To: <4F648B17.4010502@tuis.net> References: <4F648B17.4010502@tuis.net> Message-ID: > I am getting (trying to get) things all nicely tidied up, organized, and automated for the JDOM2 releases. But, one thing I have not yet sorted out is the version number(s) for the 'final' JDOM2 release. My vote would be JDOM version 2.0.0. I think people will casually refer to it as JDOM 2, which is fine. In Maven I'm OK either way. If it's under the artifact name jdom2 we avoid accidental breakage, but we will have people for years initiating new projects against jdom instead of jdom2 just by accident. -jh- From jdom at tuis.net Sat Mar 17 08:47:56 2012 From: jdom at tuis.net (Rolf Lear) Date: Sat, 17 Mar 2012 11:47:56 -0400 Subject: [jdom-interest] Passing Element parameter in Jax-ws web method In-Reply-To: References: Message-ID: <4F64B22C.4070600@tuis.net> Hi Dehaene Technically, the Element instance does have a no-arg constructor, only it is 'protected', not public. For you to solve your problem in the short term, you may want to do: class MyElement extends Element { public MyElement() { super(); } } Then you can make the method: public void hello(@WebParam(name = "xmlElement") MyElement element){ I am not familiar enough with the jax-ws to know whether this will help enough.... It will make the implementation a little messier, but, by default, JDOM doe s not let you create invalid Elements (which an Element without a name would be...). It is not likely possible to make the no-arg constructor public. Anothe roption is to create a serialized Element? Element serialization (at least in JDOM2) is reliable.... Rolf On 17/03/2012 11:14 AM, Dehaene Thomas wrote: > Hello, > > I have a problem when accepting a web parameter of the type Element. My code is as follows: > > @WebMethod(operationName = "hello") > public void hello(@WebParam(name = "xmlElement") Element element){ > > String name = new String(); > name = element.getChild("FirstName").getValue(); > > } > > Which generates an error when I try to deploy the web service. Namely that the Element class > does not have a no-arg constructor. > > If this where a normal method, I would simply make it a 'public static void' method, > but since WebMethods in Jax-ws can't be of the 'final' or 'static' type, I can't seem to > accept an Element parameter. > > Any suggestions how to solve this? > > > > > Many thanks and greetings > > Thomas Dehaene > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > From jdom at tuis.net Sat Mar 17 10:09:48 2012 From: jdom at tuis.net (Rolf Lear) Date: Sat, 17 Mar 2012 13:09:48 -0400 Subject: [jdom-interest] Why are version numbers so complicated? In-Reply-To: References: <4F648B17.4010502@tuis.net> Message-ID: <4F64C55C.1040604@tuis.net> On 17/03/2012 11:45 AM, Jason Hunter wrote: >> I am getting (trying to get) things all nicely tidied up, organized, and automated for the JDOM2 releases. But, one thing I have not yet sorted out is the version number(s) for the 'final' JDOM2 release. > > My vote would be JDOM version 2.0.0. I think people will casually refer to it as JDOM 2, which is fine. > > In Maven I'm OK either way. If it's under the artifact name jdom2 we avoid accidental breakage, but we will have people for years initiating new projects against jdom instead of jdom2 just by accident. > > -jh- > Good point about the jdom maven artifact.... which is contrary to Paul's opinion, but I see more merit in making the 'path' more obvious. Which sort of forces the version numbering.... it must be artifact 'jdom' with a version starting with 2 or 2.0 or 2.0.0. People who have 'wild-card' dependencines in maven can just set an exclusive upper-bound if they want to link to JDOM 1.x. like: [1.0,2.0) will include the latest 1.x version, but not anything 2.0.0 or later. I think I also prefer the full 3-digit version number.... If there is going to be a 2.1.1 then I think it should start with 2.0.0. In other words, make it consistent. I think that JDOM 1.x was odd (and just me calling it 1.x illustrates that) in the sense that it was JDOM 1.0, 1.1, 1.1.1, 1.1.2, 1.1.3 It is an interesting question whether it should be 2 or three digits. Is it going to be 2.0.0, 2.0.1, etc. Or 2.0, 2.1, etc. I think it is a question of release frequency and confidence.... and I think I have to be prepared for some relatively quick releases to start with, so I think I will do 2.0.0 I also think I will put a limit on the availability of some deprecated functions... I think if I do something like: version 2.1.0 will remove the deprecated XPath, Attribute-Type, and SAXBuilder classes/methods, and how about targeting 2.1 for about X-Mas 2012. So, I think that clears up a few things.... - 3-digit version. - the maven artifact will be jdom (and the jdom2 artifact can 'rot' - good for testing only). - The 'official' name for JDOM2 will be JDOM 2.0.0 and so on. - the release jar will be jdom-2.0.0.jar - maven users requiring JDOM 1.x will maybe have isues unless they set their upper bound on their dependencies.... this may be painful for them for a bit, but it is easy to fix, either by upgrading, or by limiting the upper-bound. I expect most maven users actually require a specific version, so I don't think it will be a big issue. the www.jdom.org does not even need to jave both releases 'active' concurrently, it can just go from 1.1.3 to 2.0.0. The 1.1.3 will be in the 'archive' area. Still, there is more time before this needs to be decided, so further input is welcome. Rolf From jdom at tuis.net Sat Mar 17 11:56:12 2012 From: jdom at tuis.net (Rolf Lear) Date: Sat, 17 Mar 2012 14:56:12 -0400 Subject: [jdom-interest] Why are version numbers so complicated? In-Reply-To: <750FD915-30DE-45D5-A59B-B0780E48A331@hoplahup.net> References: <4F648B17.4010502@tuis.net> <750FD915-30DE-45D5-A59B-B0780E48A331@hoplahup.net> Message-ID: <4F64DE4C.1090401@tuis.net> There is a practical issue in that (apart from the increased Jar size). I build the maven bundle using an ant task, and the ant task only has the current version of the source available. I would need to access the JDOM 1.x jar, unjar it, and then rejar it in to the jdom 2.0.0 bundle. Not that it is impossible, but, at any one point in time, I only have one branch of the git repository open. Also, remember, a requirement of the oss-nexus (the way I have linked in to maven central) is that I have to have the Javadoc and source available for all Jars. This would make the process too unwieldy for the perceived benefit. I think that some maven users will be suprised by a new jdom 2.0.0 release which causes compile failures, but, this will be a distinct minority, and easily resolved. Additionally, I think I need to say that I really want JDOM 1.x usage to 'die'. I personally have very little interest in maintaining JDOM 1.x. It is not 'sexy' work. I am also pragmatic in the sense that I know it takes time to migrate (for example, I know it will take years to accomplish that even where I work), but when JDOM 2.0.0 is available, I think people need to know in as many ways as possible.... So, I don't think there is any way that I will bundle JDOM 1.x with JDOM 2.0.0. It sends all sorts of 'wrong' messages. I think there is enough incentive as it is to move... but I am biased. I don't see how bundling jdom 1.x with jdom 2.x is going to add incentive though... it will just add to the inertia to 'stay'. Rolf On 17/03/2012 1:49 PM, Paul Libbrecht wrote: > Why not make a package org.jdom with artifactId jdom version 2.0.0 > containing both packages (org.jdom and org.jdom2) with every class in > org.jdom deprecated? > > It's a bit more work but it might offer a good incentive for the move. > > paul > From curoli at gmail.com Sun Mar 18 04:19:16 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Sun, 18 Mar 2012 07:19:16 -0400 Subject: [jdom-interest] Passing Element parameter in Jax-ws web method In-Reply-To: <4F64B22C.4070600@tuis.net> References: <4F64B22C.4070600@tuis.net> Message-ID: Hello, Technically, you can work around by putting an Element into a container (or extend it, as Rolf said), but you may want to consider not passing an Element, but something simpler instead. De-serialization re-creates all fields, including non-public ones, so the de-serialized Element should have a name if the original one had. A major problem with object serialization is class compatibility. Compatibility requires that class definitions during serialization agree with class definitions during de-serialization, including all non-public members (except transient ones) and including all dependencies (i.e. all classes needed to compile). This means if you have two Java processes communicate via serialization, and one of them had a code change (e.g. an updated library), this may break serialization, even if neither the classes directly involved, nor any API, nor any specifications have changed. If you run a server and deploy a client on the web, users may have cached versions of the client, and changes to the code may not immediately propagate to the client, which may break server-client communication. It is therefore advisable, that objects you want to serialize are as simple and stable as possible and that dependencies are few and under your control (i.e. you want to think twice before using some one else's library). Element has a lot of dependencies (e.g. it depends on Document). If you want to transmit XML, it is probably best to simply write it to a String and pass the String. After all, that's exactly what XML is for. In fact, serialization in Java internally uses XML. Take care Oliver On Sat, Mar 17, 2012 at 11:47 AM, Rolf Lear wrote: > Hi Dehaene > > Technically, the Element instance does have a no-arg constructor, only it is > 'protected', not public. > > For you to solve your problem in the short term, you may want to do: > > class MyElement extends Element { > ?public MyElement() { > ? ?super(); > ?} > } > > Then you can make the method: > > public void hello(@WebParam(name = "xmlElement") MyElement element){ > > > I am not familiar enough with the ?jax-ws to know whether this will help > enough.... It will make the implementation a little messier, but, by > default, JDOM doe s not let you create invalid Elements (which an Element > without a name would be...). > > It is not likely possible to make the no-arg constructor public. > > Anothe roption is to create a serialized Element? Element serialization (at > least in JDOM2) is reliable.... > > Rolf > > > > > > On 17/03/2012 11:14 AM, Dehaene Thomas wrote: >> >> Hello, >> >> I have a problem when accepting a web parameter of the type Element. My >> code is as follows: >> >> ? ? @WebMethod(operationName = "hello") >> ? ? public void hello(@WebParam(name = "xmlElement") Element element){ >> >> ? ? ? ? String name = new String(); >> ? ? ? ? name = element.getChild("FirstName").getValue(); >> >> ? ? } >> >> Which generates an error when I try to deploy the web service. Namely that >> the Element class >> does not have a no-arg constructor. >> >> If this where a normal method, I would simply make it a 'public static >> void' method, >> but since WebMethods in Jax-ws can't be of the 'final' or 'static' type, I >> can't seem to >> accept an Element parameter. >> >> Any suggestions how to solve this? >> >> >> >> >> Many thanks and greetings >> >> Thomas Dehaene >> _______________________________________________ >> To control your jdom-interest membership: >> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com >> > > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com -- Oliver Ruebenacker, Computational Cell Biologist Virtual Cell (http://vcell.org) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) http://www.oliver.curiousworld.org From mikeb at mitre.org Mon Mar 19 03:23:50 2012 From: mikeb at mitre.org (Brenner, Mike) Date: Mon, 19 Mar 2012 10:23:50 +0000 Subject: [jdom-interest] Why are version numbers so complicated? In-Reply-To: <4F648B17.4010502@tuis.net> References: <4F648B17.4010502@tuis.net> Message-ID: <264449A6A521A14593F58479F46B34EB1374D6@IMCMBX03.MITRE.ORG> I did not do all the work you did on it, so I don't think of jdom version 2 as a "different product called jdom2". (BTW, the dual numbering of Java itself has always confused me -- is it java 1.7 or java 7 or java2 version 7 or java 2.7?) My vote would go to JDOM version 2.1 for the first production version of the Rolf Lear work, but I can't say that I would really care if you choose some other naming convention. -----Original Message----- From: jdom-interest-bounces at jdom.org [mailto:jdom-interest-bounces at jdom.org] On Behalf Of Rolf Lear Sent: Saturday, March 17, 2012 9:01 AM To: jdom Subject: [jdom-interest] Why are version numbers so complicated? Hi all I am getting (trying to get) things all nicely tidied up, organized, and automated for the JDOM2 releases. But, one thing I have not yet sorted out is the version number(s) for the 'final' JDOM2 release. Here are the factors that influence the decision: - the www.jdom.org site is the official release site for all things JDOM. - The www.jdom.org site is going to need to have both versions (1.x and 2.x) available simultaneously. - maven has version number requirements - maven has some automated processing for dependency management - Technically the Java package is org.jdom2, not org.jdom - 'in my head' I have JDOM 1.x and JDOM2 - there is an established tradition for JDOM 1.x - there is already some sort of 'consistency' for JDOM2 - I anticipate there to be relatively routine releases for JDOM2 as new features are added and existing bugs fixed. I want it to be easy for new versions to be pushed out, and I want to be able to tell people 'just get the latest version' if there is a problem. About the maven requirements: maven has a hierarchy of resources. At the top of the hierarchy is a 'group'. We are the 'org.jdom' group. Each group releases 'things', which in maven speak is an 'artifact'. Each artifact has versions. For example, there is the group 'org.jdom', with the artifacts 'jdom', and 'jdom-contrib', and it so happens that there are the jdom versions 1.1.2 and 1.1.3, as well as the jom-contrib version 1.1.3. There is no special 'maven' reason for the version number jdom 1.1.3 to match jdom-contrib 1.1.3. I have also recently added the jdom2 artifact id for the last couple of JDOM2 beta releases. The reason I added jdom2 is because maven dependencies can be automated, where maven users can say 'I want to use the latest version of org.jdom artifact 'jdom'. I don't want people who expect to use JDOM 1.x to suddenly start getting JDOM2. But, I also needed to test whether I can do the releases to maven, and to make sure that the releases work. Finally, maven has an ordering for versions. It is logical, and systematic, but it means that, if we want to use use the two artifacts (jdom and jdom2) that I have already created, we need to keep in mind that the first available versions for them are 1.1.4 and 0.0.5 respectively. So, thinking ahead to the first full JDOM2 release, should it be: JDOM version 2 JDOM version 2.0.0 JDOM2 version 1 JDOM2 version 1.0.0 JDOM2 version 2 JDOM2 verison 2.0.0 Additionally, should I push the JDOM2 release out to the 'jdom' artifact on maven-central, or should I push it to the 'jdom2' artifact? (it affects the options for version numbers). Anyway, the point is that I am wholly uncertain as to what the 'right' answer to this is. Does anyone have any suggestions, notice anything I have missed, etc? Thanks Rolf _______________________________________________ To control your jdom-interest membership: http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com From jdom at tuis.net Mon Mar 19 03:24:56 2012 From: jdom at tuis.net (Rolf Lear) Date: Mon, 19 Mar 2012 06:24:56 -0400 Subject: [jdom-interest] Possible bug in JDom2 In-Reply-To: <4F66ECC2.7040106@peralex.com> References: <4F626651.1010705@tuis.net> <4F66ECC2.7040106@peralex.com> Message-ID: <4F670978.9030906@tuis.net> In one sense, it would make sense to be a list, but the actual iterator returned is 'live', but a list could not be (I don't think). As it is, even the iterator is complex enough, a list I think would be too challenging (and slow) to contemplate.... although it is worth investigating it further. Maybe my off-the-cuff assessment is wrong? The iterator returned is read/write, but I can't see a way to make the list modifiable. Still, I am open to some suggestions. I don't think I can get a working version of that out in less than a week or so, and that would be too long for an easter release, I think. If anyone is interested in taking a stab at making a list-based return type for getDescendants then I would happily consider it... I will still take an hour or so to look in to the feasability some more Rolf On 19/03/2012 4:22 AM, Noel Grandin wrote: > I would have thought that it should return some Collection sub-class, > e.g. List > > Since we're cleaning things up in JDOM2, now seems like a good time to > make a change like that. > > On 2012-03-15 23:59, Rolf Lear wrote: >> Hi Craig. >> >> getDescendants returns an Iterator not an Iterable >> >> Now that I think about it, it is a mess, but, that's because JDOM 1.x >> returned an iterator. >> >> Technically your code should be: >> >> for (Iterator it =root.getDescendants(tableFilter); >> it.hasNext(); ) { >> tableCount++; >> } >> >> I wonder whether I can make an 'Iterable' return value too.... it >> makes sense to, but I can't change the current return value for >> getDescendants without breaking compatibility... >> >> >> suggestions? >> >> Rolf >> >> >> >> On 15/03/2012 3:59 PM, Craig Noah wrote: >>> I've downloaded the latest JDom2 beta today and am working to >>> incorporate it into some new code. I am developing against Java6, so >>> I would expect iterators to work. However, the following code fails >>> to compile (with JDom2 includes): >>> >>> SAXBuilder sax = new SAXBuilder(); >>> Document xml = sax.build (source); // source is a File object >>> Element root = xml.getRootElement(); >>> ElementFilter tableFilter = new ElementFilter ("Table"); >>> int tableCount = 0; >>> for (Element table : root.getDescendants( >>> tableFilter)) { >>> tableCount++; >>> } >>> >>> The compile-time error that I get states, "Can only iterate over an >>> array or an instance of java.lang.Iterable". Since >>> Element.getDescendants (Filter) returns a java.util.Iterator, I >>> would expect my code to compile and work. What am I missing? >>> >>> Sincerely, >>> Craig >>> >>> >>> _______________________________________________ >>> To control your jdom-interest membership: >>> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com >> >> >> >> _______________________________________________ >> To control your jdom-interest membership: >> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > > > > ------------------------------------------------------------------------ > Disclaimer: http://www.peralex.com/disclaimer.html > From jdom at tuis.net Mon Mar 19 03:48:27 2012 From: jdom at tuis.net (Rolf Lear) Date: Mon, 19 Mar 2012 06:48:27 -0400 Subject: [jdom-interest] Why are version numbers so complicated? In-Reply-To: <264449A6A521A14593F58479F46B34EB1374D6@IMCMBX03.MITRE.ORG> References: <4F648B17.4010502@tuis.net> <264449A6A521A14593F58479F46B34EB1374D6@IMCMBX03.MITRE.ORG> Message-ID: <4F670EFB.6050505@tuis.net> I think that is the close to the way it will happen... although I intend to go with a 3-digit version. Also, I am curious about why you suggest 2.1 instead of 2.0 (or in my 3-digit thinking would it be 2.0.1 or 2.1.0?). The way you qualify it as being 'version 2.1 for the first production version' seems to imply that you expect 2.0 or something for the first non-production version. I believe that the very first '2' release will be fully production-ready. The 'Easter' release date is for the 'final' and 'stable' release. I have been convinced that the release will be just 'JDOM', and not 'JDOM2' and that the version number will reflect the difference. In essence, as you say, JDOM remains the same product as before, just a new version. In my (current) way of 'sorting' out the version numbers, the three digits boil down to: a.b.c where 'a' is the 'API version', 'b' is the 'feature' version, and 'c' is the 'patch' version. The major change to the API from JDOM 1.x.x to 2.x.x is reflected in the version number. If any new features are added (in an API sable way) then the 'feature' version could be updated to 2.1.x, and any bug fixes to a particular feature version will be reflected in the final digit. I do believe there will be some additional feature entry in to JDOM in the next year or so, so I expect there to be a 2.1.0 at some point (I am thinking XPath 2.0 support at a minimum), so while there may be some more 'regular' updates to JDOM, it does not imply that 2.0.0 is not production ready ... ;-) So, I think, for the most part, you will find that the releases are similar to what you are suggesting (but a 3 digit version, and a the 'production ready' version will be at 2.0.0 not 2.1.0) Rolf On 19/03/2012 6:23 AM, Brenner, Mike wrote: > I did not do all the work you did on it, so I don't think of jdom version 2 as a "different product called jdom2". > > (BTW, the dual numbering of Java itself has always confused me -- is it java 1.7 or java 7 or java2 version 7 or java 2.7?) > > My vote would go to JDOM version 2.1 for the first production version of the Rolf Lear work, > but I can't say that I would really care if you choose some other naming convention. > From jdom at tuis.net Tue Mar 20 10:06:07 2012 From: jdom at tuis.net (Rolf Lear) Date: Tue, 20 Mar 2012 13:06:07 -0400 Subject: [jdom-interest] Initial questions - build jdom In-Reply-To: References: Message-ID: <2a077320403caeb4d4c430b85d4bc735@tuis.net> Hi George. I imagine you have pulled the 1.1.3 version of JDOM. The zip file is quite big, and contains a lot of 'stuff'. It does contain the JDOM jar (and a matching jar for javadocs and source if you use an IDE like eclipse). Have a look in the zip file in the build/ folder, and you will find build/jdom-1.1.3.jar Hope that helps. Though, having said that, if you are new to JDOM I encourage you to start with the (currently in very late beta - final version to be released in a couple of weeks) to work with JDOM 2.x JDOM 2.x will be much easier for you to start with because it has a more intuitive system of generics, etc. (It has generics, 1.1.3 does not). So, how about you get JDOM 2.x from https://github.com/downloads/hunterhacker/jdom/jdom2-0.0.4-BETA.zip That zip file contains the three jars (jdom, apidocs, and sources), as well as the dependancy jars you need to run it. If you have any questions don't hesitate to shout out. Happy coding Rolf On Tue, 20 Mar 2012 12:17:36 -0400, George Seese wrote: > Windows 7, beginner to java. > > > The download page said "Binary releases come with source, but you're not > required to build the code yourself." > From jdom at tuis.net Tue Mar 20 11:32:55 2012 From: jdom at tuis.net (Rolf Lear) Date: Tue, 20 Mar 2012 14:32:55 -0400 Subject: [jdom-interest] Initial questions - build jdom In-Reply-To: References: Message-ID: <1181e3c2d20bfc928f0b1c6518d06949@tuis.net> Hi again, George. I put together a 'quick' primer for using JDOM (the current BETA version...): https://github.com/hunterhacker/jdom/wiki/JDOM2:-A-Primer Rolf On Tue, 20 Mar 2012 12:17:36 -0400, George Seese wrote: > Windows 7, beginner to java. > > > The download page said "Binary releases come with source, but you're not > required to build the code yourself." > > q1. But there is no "bin" folder. Doesn't that mean a build is needed? > > > The README says the build "will generate a file called "jdom.jar" in the > "./build" directory. > > That file does not exist (from download), so I assume that a build is > required. > > q2. If the source files are .java, why does a build generate jar files? The > jdk bin folder has .exe files. > > > I followed instructions to build. The results are shown in the Command > Prompt below. > > q3. Can you determine what the problem may be? > build.bat > > JDOM Build System > > ----------------- > > Building with classpath c:\Program > files\Java\jdk1.6.0_31\lib\tools.jar;.\lib\ant.jar;.\lib\xml-apis.jar;.\lib\xerces.jar; > > Starting Ant... > > Buildfile: build.xml > > init: > > [echo] -------- JDOM 1.1.2-snap --------- > > prepare: > prepare-src: > > BUILD FAILED > > file:C:file/Program%20Files/Java/jdom/build.xml:133: Directory c:\Program > Files\Java\jdom\build\src creation was not successful for unknown reason > > Total time: 0 seconds From jdom at tuis.net Thu Mar 22 08:31:39 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 22 Mar 2012 11:31:39 -0400 Subject: [jdom-interest] In the 11th Hour Message-ID: <4F6B45DB.90908@tuis.net> Hi all. I am running out of things to do with JDOM, and that means that it's ready to release. I have been playing with the performance of critical parts of the code, and have recently committed some improvements. Still, the fact is that JDOM 2.x is ready to go unless something comes up.... ... so, does anyone have any concerns? I would really appreciate it if someone could go through the JavaDoc, check for any issues, etc. I have been through it a couple of times, and each time there's something to fix, and it is a thankless process. But a different set of eyes would be very useful. I have one small performance issue to tackle, but, other than that there's nothing ... left ... to .... do .... I anticipate making one more BETA version, probably Sunday Evening with the performance updates, but that will be the last one before the Easter release. Here's the details on the performance of JDOM 2.x... Rolf >> ======================== I have just pushed through some performance fixes. In essence the fixes restore all performance benchmarks (except one) to be better than the same benchmarks on JDOM 1.x. http://hunterhacker.github.com/jdom/jdom2/performance.html All benchmarks compare apples to apples, are all run on my laptop, with the same data, etc. SAX Parsing is as fast as before (slightly faster - less than 5% - parsing times are too erratic to get meaningful times), but note, the benchmark only tests a single parse, and JDOM 2.x has massive improvements in parser reuse, so in general, JDOM 2.x is significantly better. XMLOutputting is much faster than before, and additionally the API is more consistent. Output is at least 15% faster (8ms instead of 10ms) XPath access is significantly faster (about a third faster 15ms instead of 24ms) Creating JDOM content (cloning an entire tree) is slightly faster (7.5%) than before (7.4ms instead of 8ms). Additionally, the memory footprint is 10% less than before (2.06MB instead of 2.26MB) The only current performance 'slip' is the time it takes to scan an entire document and report the Elements in it. This has slipped significantly from 2.3ms to 3.1ms. I have been investigating it, and it appears to be a function of the additional functionality in the DescendantIterator, and the type-safety of the code. I am still working on it though. From jdom at tuis.net Thu Mar 22 11:20:40 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 22 Mar 2012 14:20:40 -0400 Subject: [jdom-interest] In the 11th Hour In-Reply-To: <4F6B45DB.90908@tuis.net> References: <4F6B45DB.90908@tuis.net> Message-ID: <4F6B6D78.1040205@tuis.net> On 22/03/2012 11:31 AM, Rolf Lear wrote: > Hi all. > .... > > The only current performance 'slip' is the time it takes to scan an > entire document and report the Elements in it. This has slipped > significantly from 2.3ms to 3.1ms. I have been investigating it, and it > appears to be a function of the additional functionality in the > DescendantIterator, and the type-safety of the code. I am still working > on it though. > And that slip is now fixed, with JDOM 2.x now running in 1.7ms, better than 20% improvement over JDOM 1.x So, in all benchmarked components, JDOM 2.x outperforms JDOM 1.x, and uses less memory. Rolf From mikeb at mitre.org Thu Mar 22 11:47:28 2012 From: mikeb at mitre.org (Brenner, Mike) Date: Thu, 22 Mar 2012 18:47:28 +0000 Subject: [jdom-interest] In the 11th Hour In-Reply-To: <4F6B6D78.1040205@tuis.net> References: <4F6B45DB.90908@tuis.net> <4F6B6D78.1040205@tuis.net> Message-ID: <264449A6A521A14593F58479F46B34EB151F53@IMCMBX03.MITRE.ORG> Hi Rolf, I hope you decide to take the time to post how you accomplished this, to make us all better java programmers. Thanks, Mike Brenner -----Original Message----- From: jdom-interest-bounces at jdom.org [mailto:jdom-interest-bounces at jdom.org] On Behalf Of Rolf Lear Sent: Thursday, March 22, 2012 2:21 PM To: jdom-interest at jdom.org Subject: Re: [jdom-interest] In the 11th Hour On 22/03/2012 11:31 AM, Rolf Lear wrote: > Hi all. > .... > > The only current performance 'slip' is the time it takes to scan an > entire document and report the Elements in it. This has slipped > significantly from 2.3ms to 3.1ms. I have been investigating it, and it > appears to be a function of the additional functionality in the > DescendantIterator, and the type-safety of the code. I am still working > on it though. > And that slip is now fixed, with JDOM 2.x now running in 1.7ms, better than 20% improvement over JDOM 1.x So, in all benchmarked components, JDOM 2.x outperforms JDOM 1.x, and uses less memory. Rolf _______________________________________________ To control your jdom-interest membership: http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com From paul at hoplahup.net Thu Mar 22 13:27:58 2012 From: paul at hoplahup.net (Paul Libbrecht) Date: Thu, 22 Mar 2012 21:27:58 +0100 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <7548b7e7900688521a8c8c25fee97a1d@tuis.net> References: <46e2c3a3a56ab2177870a95c4ba4fa98@tuis.net> <7548b7e7900688521a8c8c25fee97a1d@tuis.net> Message-ID: <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> Hello list, Rolf has been so kind to show me how JDOM issue #5 can be run. So I ran the following snippet: SAXBuilder builder = new SAXBuilder(XMLReaders.DTDVALIDATING); Document doc = builder.build(new URL(args[0])); Format speconly = Format.getRawFormat(); speconly.setSpecifiedAttributesOnly(true); XMLOutputter xout = new XMLOutputter(speconly); xout.output(doc, System.out); which allows me to parse a JDOM source, make modifications (typically: refactorings), then output with almost no difference. The big advantage to that is that the attributes that were not there... are simply not injected from the DTD. This is enormous in some XML editing tradition which uses implied values a lot. There's two BUT: 1) This currently fails if the validation fails and this is a big problem to me. An example file would be the following: http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath which references a DTD nearby. This is a manually edited file. Removing the validation, sadly disables the passing of attribute presence info, it seems. Rolf, is there a way that the attribute presence info is passed but the validation is not stopped? 2) namespace declarations, which are kind of attributes, still resurface. They should be avoided if not present ideally. Doable? The approach of Rolf is better than the one I had because mine was simply checking in the DTD if the attribute was provided by it and, if yes, removing its output while in Rolf's approach, an attribute that is there is output if... it was there, simply! Thanks for comments. paul From jdom at tuis.net Thu Mar 22 13:33:15 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 22 Mar 2012 16:33:15 -0400 Subject: [jdom-interest] Fwd: Hello and facing a problem with JDOM In-Reply-To: References: Message-ID: <4F6B8C8B.60809@tuis.net> Hi Iman I am unsure of how you added the JDOM 1.1.3 the first way, but, if it works when you add the Jars individually then the Jars themselves must be OK, the problem must be in how they are added to eclipse. I wonder whether you are just copying the JDOM jars over the 'old' 1.1.2 jars, but they have different names... jdom-1.1.3.jar not the old jdom-1.1.2.jar, and that eclipse is looking for the 1.1.2 version of the name..... There are a few things I try to do when I have eclipse build problems: 1. select your project and right-click -> Refresh (F5) to make sure eclipse is up-to-date with everything. 2. open the 'problems' view: Window -> Show View -> Problems You will see in the problems view all the issues that Eclipse has, with the errors at the top. If you select the 'errors' (with red icons), copy them, and paste them in to an e-mail I will have a better idea of what's wrong. Rolf On 22/03/2012 3:21 PM, Jason Hunter wrote: > Forwarding to the jdom-interest mailing list. > > -jh- > > Begin forwarded message: > >> *From: *Iman Zabet > >> *Subject: **Hello and facing a problem with JDOM* >> *Date: *March 22, 2012 3:16:48 PM EDT >> *To: *jhunter at servlets.com >> >> Hello Dear Jason, >> >> First of all, I am Iman Zabet, and thank you about your JDOM project. >> >> Since, I have a problem with the latest version of your product and >> can not find a forum of your website, I wrote this email for your >> considerations. >> >> I have downloaded the last version of JDOM(1.1.3), and intended to >> substitute its build jar files with the previous version (1.1.2). >> >> Unfortunately, when I want to add them in a user-defined library in >> "Java Build Path" in Eclipse, they are added as (missing) resourses >> and can not be recognized by IDE in workspace. When I switch back >> again to the previous version every thing goes okay! >> >> Is this issue about my old 2009 version of eclipse of a common issue? >> >> But I found they can be added as individual jar (with "add external >> jars..." botton), not under a user-defined folder with other resources! >> >> Thanks in advance, >> Iman > > > From jdom at tuis.net Thu Mar 22 15:01:47 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 22 Mar 2012 18:01:47 -0400 Subject: [jdom-interest] In the 11th Hour In-Reply-To: <264449A6A521A14593F58479F46B34EB151F53@IMCMBX03.MITRE.ORG> References: <4F6B45DB.90908@tuis.net> <4F6B6D78.1040205@tuis.net> <264449A6A521A14593F58479F46B34EB151F53@IMCMBX03.MITRE.ORG> Message-ID: <4F6BA14B.5040208@tuis.net> On 22/03/2012 2:47 PM, Brenner, Mike wrote: > Hi Rolf, > I hope you decide to take the time to post how you accomplished this, to make us all better java programmers. > > Thanks, > Mike Brenner > Hi Mike. Thanks... I think.... but I have to admit, I am still learning things too (and have been for years... which I think helps). It helps that I have spent years doing exactly this sort of thing (high-efficiency Java code) as a regular job. I typically focus more on the memory side of things because memory typically is the first thing to break with Java... when you run out of -Xmx it just dies, but if it runs slow, well, there's always more days to process things in... ;-) My Java work experience is very useful though... I typically work on 'big' machines, in the finance arena, where setting -Xmx64g is 'routine', and processing tens of millions of records in memory at one time is also common. Having lots of memory creates bad habits in programmers, and when you develop your applications with small scale data, it is easy to set -Xmx2g, build your system, and then, when you scale the system to big data, it all blows up because you have non-linear scalability. That's sort of where I get involved... ;-) So, if you take the specific example of the 'Scan' times, I can go through the process I took to get it fast (again). The first trick is to have realistic expectations of what results you should see. With this code, when I started with JDOM 2.x I was able to take a benchmark of the code, and I have been able to refer back to that. That creates a 'minimum' expectation. If I had not done that then I would not have known there was a problem. I am a believer that performance should be 'regression tested' as much as the actual functionality. So, having identified the scan time as a problem, I just took the scan benchmark code and wrote a loop around it.... int cnt = 0; long start = System.currentTimeMillis(); while (true) { cnt++; int sz = 0; for (Content e : doc.getDescendants()) { sz++; } if ((cnt % 1000) == 0) { long time = System.currentTimeMillis() - start; start = start + time; System.out.printf("Looped %d times @ %.3fms each (%d lines)\n", cnt, time / 1000.0, sz); } } The challenge is then simply to make the time-per-thousand as small as possible. Then, while that is running, you launch the trusty $JAVA_HOME/bin/jvisualvm , and you profile the code. With some tweaking of the settings, you should be looking for three things: CPU Hogs, Memory Hogs, and Tight Loops - Profile the CPU time, and once the profiling has taken 'some time' (20 seconds or so), you create a snapshot of the profiling. Look at the profile (not the snapshot), and get an idea of the methods that took the most time. Then, in the snapshot, dig down in to it to find out where the code is being called from, why it is called, how often, and where it is spending its time. This is a little bit of 'black magic' to identify where the 'hot spot' is. You need to be able to look at the source code, switch to the snapshot, and back again, and so on. - while you are in this snapshot, look for items that are called very often. Even a small improvement on 'tight' code will make a big difference. (I have spent a lot of time working on the 'Verifier' for example). Then, once you have identified the 'hot' code you can look at the code, and decide if things could be done differently. Even a small change can make a difference. Once you make a small change, re-run the test loop (no profiling). Check to see whether it is any faster (or slower). When I start running out of things in the CPU profiles to 'tweak', I start looking at the memory side of things. Often you see a lot of memory allocation in strange patterns. Use the Memory profiler in jvisualvm (make sure you change the setting to enable "Record allocation's stack trace"). Getting back to the 'scan' loops, I tried a number of different changes to get things better. Changing the whole iterator around (making hasNext() do nothing, and putting all the logic in to 'next()') brought it from 3.2ms to about 2.7ms... which was great, but not as good as JDOM 1.x. So, I after looking at things some more, I decided to try to replace the LinkedList as a stack. I know this sort of thing makes a difference (from experience). I was somewhat surprised by how much of a difference it made though.... For the record, the reason I chose the DescendantIterator to work on is because I had already gone through this sort of exercise with the ContentList Iterators before (bringing the XPath code from 90ms to 30ms), so I did not think I would find much there. If I was starting 'fresh' I would have looked at the lower levels some more too. Once I had make a good change, I reran the full 'benchmark' and checked the results. A more interesting fix was the one on the FormatStack. I introduced the FormatStack itself to solve a performance problem - having to keep re-calculating the 'state' of the output formats... But, when I ran the XMLOutputter code through the Memory profiler I found a lot of time was spent in the Format.escapeText() method. This is a very tight method call, and I tried really hard to make it faster. I did a few things which made a difference, but, when I changed from 'Raw' format to 'Pretty' format, it was still very slow. I cold not find the slow-down until I checked the memory profiler, and identified that a huge amount of char[] memory was being created in the FormatStack.push() method. Memory is a double-edged sword in Java (and other languages). For every byte you allocate, you also need to collect it. Allocating is slow, and collecting is slow (especially if it is a 'full' collection - single-threaded). So, as soon as I saw all that memory, and realized that it was coming from 'build the indent string' type string concatenation, I found a way to not do that unless 'things change', so, now the FormatStack reuses the indents. This made a huge impact on performance. Saving the allocation of memory pays for itself three times over: not having to do the calculation of the new value, not having to allocate and store the result, and not having to GC it later. Additionally, because JDOM has a comprehensive test harness, it is easy to make reliable changes, even if they would otherwise be called 'big'. It becomes 'easy' to play with things and see how it changes. Yet another interesting one is the 'escapeText()' code in Format. It used to look like: public boolean shouldEscape(char ch) { if (bits == 16) { if (Verifier.isHighSurrogate(ch)) { return true; // Safer this way per http://unicode.org/faq/utf_bom.html#utf8-4 } return false; } if (bits == 8) { if (ch > 255) { return true; } return false; } if (bits == 7) { if (ch > 127) { return true; } return false; } if (Verifier.isHighSurrogate(ch)) return true; // Safer this way per http://unicode.org/faq/utf_bom.html#utf8-4 if (encoder != null) { try { return !encoder.canEncode(ch); } catch (Exception ignored) { // ignore problems. } } // Return false if we don't know. This risks not escaping // things which should be escaped, but also means people won't // start getting loads of unnecessary escapes. return false; } That is a 'big' method, especially when it is called for *every* character that is output by JDOM. The thing is that the nature of the output (7bit, 8bit, 16bit, Charset, or default) is known when the Format instance is created. Having to check again for each character is unnecessary. So, I changed the code to have different escape strategies for each type, and replaced the 'Default' with one of a set of them. Now, for example, if you are outputting to a UTF-8 based output, the method is simply: public final boolean shouldEscape(final char ch) { return Verifier.isHighSurrogate(ch); } And, since I have different instances of the class for different 'shouldEscape()' implementations, the decision is made once for each Format, instead of once for each char. Summary.... Here are some 'tips' I have come to learn, and they influence my coding style, and how I look at performance problems. Some of them have bad reputations... and people complain about 'optimizing too early', ... but, it works for me. Use 'final' as liberally as possible. Classes, methods, variables, parameters, everything. I have learned, over the years, to be very wary of the Collections API when it comes to memory and performance. They are all extreme memory hogs, and as a result, they are slow... Be careful using them in performance-critical code. arrays of primitives are often just as easy to use, and are faster. while-loops with break/continue statements are faster than conditionals in the loop. loops with a constant test are much faster than loops with conditionals... for example: for (int i = 0; i < list.size(); i++) { .... } the above is fine, but this is better: final int len = list.size(); for (int i = 0; i < len; i++) { ... } When you can, do count-down loops instead: int idx = list.size(); while (--index >= 0) { .... } From mikeb at mitre.org Fri Mar 23 05:49:56 2012 From: mikeb at mitre.org (Brenner, Mike) Date: Fri, 23 Mar 2012 12:49:56 +0000 Subject: [jdom-interest] In the 11th Hour In-Reply-To: <4F6BA14B.5040208@tuis.net> References: <4F6B45DB.90908@tuis.net> <4F6B6D78.1040205@tuis.net> <264449A6A521A14593F58479F46B34EB151F53@IMCMBX03.MITRE.ORG> <4F6BA14B.5040208@tuis.net> Message-ID: <264449A6A521A14593F58479F46B34EB1596AD@IMCMBX03.MITRE.ORG> Hi Rolf, Thank you very much for the performance details! I definitely think you made me a better java programmer. Mike Brenner ----------------------------------------------- From: Rolf Lear [mailto:jdom at tuis.net] Sent: Thursday, March 22, 2012 6:02 PM To: Brenner, Mike Cc: jdom-interest at jdom.org Subject: Re: [jdom-interest] In the 11th Hour On 22/03/2012 2:47 PM, Brenner, Mike wrote: > Hi Rolf, > I hope you decide to take the time to post how you accomplished this, to make us all better java programmers. Hi Mike. Thanks... I think.... but I have to admit, I am still learning things too (and have been for years... which I think helps). It helps that I have spent years doing exactly this sort of thing (high-efficiency Java code) as a regular job. I typically focus more on the memory side of things because memory typically is the first thing to break with Java... when you run out of -Xmx it just dies, but if it runs slow, well, there's always more days to process things in... ;-) My Java work experience is very useful though... I typically work on 'big' machines, in the finance arena, where setting -Xmx64g is 'routine', and processing tens of millions of records in memory at one time is also common. Having lots of memory creates bad habits in programmers, and when you develop your applications with small scale data, it is easy to set -Xmx2g, build your system, and then, when you scale the system to big data, it all blows up because you have non-linear scalability. That's sort of where I get involved... ;-) So, if you take the specific example of the 'Scan' times, I can go through the process I took to get it fast (again). The first trick is to have realistic expectations of what results you should see. With this code, when I started with JDOM 2.x I was able to take a benchmark of the code, and I have been able to refer back to that. That creates a 'minimum' expectation. If I had not done that then I would not have known there was a problem. I am a believer that performance should be 'regression tested' as much as the actual functionality. So, having identified the scan time as a problem, I just took the scan benchmark code and wrote a loop around it.... int cnt = 0; long start = System.currentTimeMillis(); while (true) { cnt++; int sz = 0; for (Content e : doc.getDescendants()) { sz++; } if ((cnt % 1000) == 0) { long time = System.currentTimeMillis() - start; start = start + time; System.out.printf("Looped %d times @ %.3fms each (%d lines)\n", cnt, time / 1000.0, sz); } } The challenge is then simply to make the time-per-thousand as small as possible. Then, while that is running, you launch the trusty $JAVA_HOME/bin/jvisualvm , and you profile the code. With some tweaking of the settings, you should be looking for three things: CPU Hogs, Memory Hogs, and Tight Loops - Profile the CPU time, and once the profiling has taken 'some time' (20 seconds or so), you create a snapshot of the profiling. Look at the profile (not the snapshot), and get an idea of the methods that took the most time. Then, in the snapshot, dig down in to it to find out where the code is being called from, why it is called, how often, and where it is spending its time. This is a little bit of 'black magic' to identify where the 'hot spot' is. You need to be able to look at the source code, switch to the snapshot, and back again, and so on. - while you are in this snapshot, look for items that are called very often. Even a small improvement on 'tight' code will make a big difference. (I have spent a lot of time working on the 'Verifier' for example). Then, once you have identified the 'hot' code you can look at the code, and decide if things could be done differently. Even a small change can make a difference. Once you make a small change, re-run the test loop (no profiling). Check to see whether it is any faster (or slower). When I start running out of things in the CPU profiles to 'tweak', I start looking at the memory side of things. Often you see a lot of memory allocation in strange patterns. Use the Memory profiler in jvisualvm (make sure you change the setting to enable "Record allocation's stack trace"). Getting back to the 'scan' loops, I tried a number of different changes to get things better. Changing the whole iterator around (making hasNext() do nothing, and putting all the logic in to 'next()') brought it from 3.2ms to about 2.7ms... which was great, but not as good as JDOM 1.x. So, I after looking at things some more, I decided to try to replace the LinkedList as a stack. I know this sort of thing makes a difference (from experience). I was somewhat surprised by how much of a difference it made though.... For the record, the reason I chose the DescendantIterator to work on is because I had already gone through this sort of exercise with the ContentList Iterators before (bringing the XPath code from 90ms to 30ms), so I did not think I would find much there. If I was starting 'fresh' I would have looked at the lower levels some more too. Once I had make a good change, I reran the full 'benchmark' and checked the results. A more interesting fix was the one on the FormatStack. I introduced the FormatStack itself to solve a performance problem - having to keep re-calculating the 'state' of the output formats... But, when I ran the XMLOutputter code through the Memory profiler I found a lot of time was spent in the Format.escapeText() method. This is a very tight method call, and I tried really hard to make it faster. I did a few things which made a difference, but, when I changed from 'Raw' format to 'Pretty' format, it was still very slow. I cold not find the slow-down until I checked the memory profiler, and identified that a huge amount of char[] memory was being created in the FormatStack.push() method. Memory is a double-edged sword in Java (and other languages). For every byte you allocate, you also need to collect it. Allocating is slow, and collecting is slow (especially if it is a 'full' collection - single-threaded). So, as soon as I saw all that memory, and realized that it was coming from 'build the indent string' type string concatenation, I found a way to not do that unless 'things change', so, now the FormatStack reuses the indents. This made a huge impact on performance. Saving the allocation of memory pays for itself three times over: not having to do the calculation of the new value, not having to allocate and store the result, and not having to GC it later. Additionally, because JDOM has a comprehensive test harness, it is easy to make reliable changes, even if they would otherwise be called 'big'. It becomes 'easy' to play with things and see how it changes. Yet another interesting one is the 'escapeText()' code in Format. It used to look like: public boolean shouldEscape(char ch) { if (bits == 16) { if (Verifier.isHighSurrogate(ch)) { return true; // Safer this way per http://unicode.org/faq/utf_bom.html#utf8-4 } return false; } if (bits == 8) { if (ch > 255) { return true; } return false; } if (bits == 7) { if (ch > 127) { return true; } return false; } if (Verifier.isHighSurrogate(ch)) return true; // Safer this way per http://unicode.org/faq/utf_bom.html#utf8-4 if (encoder != null) { try { return !encoder.canEncode(ch); } catch (Exception ignored) { // ignore problems. } } // Return false if we don't know. This risks not escaping // things which should be escaped, but also means people won't // start getting loads of unnecessary escapes. return false; } That is a 'big' method, especially when it is called for *every* character that is output by JDOM. The thing is that the nature of the output (7bit, 8bit, 16bit, Charset, or default) is known when the Format instance is created. Having to check again for each character is unnecessary. So, I changed the code to have different escape strategies for each type, and replaced the 'Default' with one of a set of them. Now, for example, if you are outputting to a UTF-8 based output, the method is simply: public final boolean shouldEscape(final char ch) { return Verifier.isHighSurrogate(ch); } And, since I have different instances of the class for different 'shouldEscape()' implementations, the decision is made once for each Format, instead of once for each char. Summary.... Here are some 'tips' I have come to learn, and they influence my coding style, and how I look at performance problems. Some of them have bad reputations... and people complain about 'optimizing too early', ... but, it works for me. Use 'final' as liberally as possible. Classes, methods, variables, parameters, everything. I have learned, over the years, to be very wary of the Collections API when it comes to memory and performance. They are all extreme memory hogs, and as a result, they are slow... Be careful using them in performance-critical code. arrays of primitives are often just as easy to use, and are faster. while-loops with break/continue statements are faster than conditionals in the loop. loops with a constant test are much faster than loops with conditionals... for example: for (int i = 0; i < list.size(); i++) { .... } the above is fine, but this is better: final int len = list.size(); for (int i = 0; i < len; i++) { ... } When you can, do count-down loops instead: int idx = list.size(); while (--index >= 0) { .... } From jdom at tuis.net Fri Mar 23 06:21:41 2012 From: jdom at tuis.net (Rolf Lear) Date: Fri, 23 Mar 2012 09:21:41 -0400 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> References: <46e2c3a3a56ab2177870a95c4ba4fa98@tuis.net> <7548b7e7900688521a8c8c25fee97a1d@tuis.net> <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> Message-ID: <4F6C78E5.3060205@tuis.net> Hi Paul. If you were wondering why no-one on the list has commented, it may be because you you never sent it to the list, just to me ... ;-), so I have CC'd the list for you... Anyway, I have been looking in to things, and I think the problem is that you have missed a detail in the way the data is processed. Using your example document: http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath This document (apart from being 'big'), refers to a single DTD, which, in the case of this document, only really defaults one attribute: 'scheme' on the 'competency' element (which defaults to "PISA"). Now, as far as I know, there are only the following ways to reference content of the DTD: If you are doing no DTD validation, the DTD will still be accessed to resolve entity references. But, that is the *only* thing that will be pulled form the DTD. If you do validation, then the entire DTD is read, and the validation is done, and any attributes defaulted in the DTD will be created in the XML 'Model'. So, it is my understanding that it is impossible to have 'all the defaulted attributes' without also having done the full DTD Validation. As it happens, I often use the tool 'xmllint' (available on most unix systems, including linux) to check my understanding, and, I may be wrong on this because xmllint has the argument --dtdattr which appears to do a partial thing of loading the defaulted attrs, but not a full validation... Anyway, the point is that, using JDOM, and standard SAX parsing, the only time you could have had 'all the defaulted attrs was when you were doing full validation anyway... and that full validation fails. So, if you do not do validating, you will not get the 'scheme' attributes, and you will not output the scheme attributes (you do not have them to output...). If you do validating, then you have the scheme attributes, and then you can now choose to ignore them on the output with the new Format setting. Your particular problem is confusing to me, and there must be something I am missing.... I can't figure out why you think you are getting all the defaulted attributes when it is clear you are not validating... So, that is my first issue, and I think it means that you are confused too ;-) The second issue with the namespace declarations is also confusing to me. In your example document, every single namespace declaration is essential.... not a single one is 'redundant'. Is it possible that it is just a bad example? Anyway, at the worst possible case, I have a hack that would probably make you happy, but makes me cringe.... I would rather understand your problem properly before I suggest it. Thanks Rolf On 22/03/2012 4:27 PM, Paul Libbrecht wrote: > > Hello list, > > Rolf has been so kind to show me how JDOM issue #5 can be run. > > So I ran the following snippet: > > SAXBuilder builder = new SAXBuilder(XMLReaders.DTDVALIDATING); > Document doc = builder.build(new URL(args[0])); > Format speconly = Format.getRawFormat(); > speconly.setSpecifiedAttributesOnly(true); > XMLOutputter xout = new XMLOutputter(speconly); > xout.output(doc, System.out); > > which allows me to parse a JDOM source, make modifications (typically: refactorings), then output with almost no difference. > > The big advantage to that is that the attributes that were not there... are simply not injected from the DTD. > This is enormous in some XML editing tradition which uses implied values a lot. > > There's two BUT: > > 1) This currently fails if the validation fails and this is a big problem to me. > An example file would be the following: > http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath > which references a DTD nearby. This is a manually edited file. > > Removing the validation, sadly disables the passing of attribute presence info, it seems. > Rolf, is there a way that the attribute presence info is passed but the validation is not stopped? > > > 2) namespace declarations, which are kind of attributes, still resurface. They should be avoided if not present ideally. Doable? > > The approach of Rolf is better than the one I had because mine was simply checking in the DTD if the attribute was provided by it and, if yes, removing its output while in Rolf's approach, an attribute that is there is output if... it was there, simply! > > Thanks for comments. > > paul From paul at hoplahup.net Fri Mar 23 06:40:39 2012 From: paul at hoplahup.net (Paul Libbrecht) Date: Fri, 23 Mar 2012 14:40:39 +0100 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <4F6C78E5.3060205@tuis.net> References: <46e2c3a3a56ab2177870a95c4ba4fa98@tuis.net> <7548b7e7900688521a8c8c25fee97a1d@tuis.net> <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> <4F6C78E5.3060205@tuis.net> Message-ID: <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> Rolf, I think your assumption is wrong: I remember Michael Kay had a long FAQ entry about justifying why a DTD was read even though validation was not activated (for Saxon Aelfred which we have extensively used) and indeed it is my experience that any parser, Xerces included, parses the DTD completely (including included entities as is the case here) and injects all default values of attributes (including namespaces) without it being validating. Validating implies breaking somehow after an error (the first or the last?). To summarize I see the following modes: - ignore the DTD completely (no parser does this unless explicitly told it) - use DTD (and inclusions) for all default values - use DTD and report all errors but keep doing - use DTD and break at first error My understanding is that my SAXBuilder.build was throwing an exception if I activated DTD validation (so the last two possibilities) thus making it impossible obtain a good jdom Document object form a slightly invalid document. paul PS: sorry for the mailing-fuss, I thought I sent it to the list a bit later realizing that jdom at tuis.net was not... the list... Le 23 mars 2012 ? 14:21, Rolf Lear a ?crit : > Hi Paul. > > If you were wondering why no-one on the list has commented, it may be because you you never sent it to the list, just to me ... ;-), so I have CC'd the list for you... > > Anyway, I have been looking in to things, and I think the problem is that you have missed a detail in the way the data is processed. > > Using your example document: > http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath > > This document (apart from being 'big'), refers to a single DTD, which, in the case of this document, only really defaults one attribute: 'scheme' on the 'competency' element (which defaults to "PISA"). > > Now, as far as I know, there are only the following ways to reference content of the DTD: > > If you are doing no DTD validation, the DTD will still be accessed to resolve entity references. But, that is the *only* thing that will be pulled form the DTD. > > If you do validation, then the entire DTD is read, and the validation is done, and any attributes defaulted in the DTD will be created in the XML 'Model'. > > So, it is my understanding that it is impossible to have 'all the defaulted attributes' without also having done the full DTD Validation. > > As it happens, I often use the tool 'xmllint' (available on most unix systems, including linux) to check my understanding, and, I may be wrong on this because xmllint has the argument --dtdattr which appears to do a partial thing of loading the defaulted attrs, but not a full validation... > > Anyway, the point is that, using JDOM, and standard SAX parsing, the only time you could have had 'all the defaulted attrs was when you were doing full validation anyway... and that full validation fails. > > So, if you do not do validating, you will not get the 'scheme' attributes, and you will not output the scheme attributes (you do not have them to output...). > > If you do validating, then you have the scheme attributes, and then you can now choose to ignore them on the output with the new Format setting. > > Your particular problem is confusing to me, and there must be something I am missing.... I can't figure out why you think you are getting all the defaulted attributes when it is clear you are not validating... > > So, that is my first issue, and I think it means that you are confused too ;-) > > > The second issue with the namespace declarations is also confusing to me. In your example document, every single namespace declaration is essential.... not a single one is 'redundant'. > > Is it possible that it is just a bad example? > > Anyway, at the worst possible case, I have a hack that would probably make you happy, but makes me cringe.... I would rather understand your problem properly before I suggest it. > > Thanks > > Rolf > > > On 22/03/2012 4:27 PM, Paul Libbrecht wrote: >> >> Hello list, >> >> Rolf has been so kind to show me how JDOM issue #5 can be run. >> >> So I ran the following snippet: >> >> SAXBuilder builder = new SAXBuilder(XMLReaders.DTDVALIDATING); >> Document doc = builder.build(new URL(args[0])); >> Format speconly = Format.getRawFormat(); >> speconly.setSpecifiedAttributesOnly(true); >> XMLOutputter xout = new XMLOutputter(speconly); >> xout.output(doc, System.out); >> >> which allows me to parse a JDOM source, make modifications (typically: refactorings), then output with almost no difference. >> >> The big advantage to that is that the attributes that were not there... are simply not injected from the DTD. >> This is enormous in some XML editing tradition which uses implied values a lot. >> >> There's two BUT: >> >> 1) This currently fails if the validation fails and this is a big problem to me. >> An example file would be the following: >> http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath >> which references a DTD nearby. This is a manually edited file. >> >> Removing the validation, sadly disables the passing of attribute presence info, it seems. >> Rolf, is there a way that the attribute presence info is passed but the validation is not stopped? >> >> >> 2) namespace declarations, which are kind of attributes, still resurface. They should be avoided if not present ideally. Doable? >> >> The approach of Rolf is better than the one I had because mine was simply checking in the DTD if the attribute was provided by it and, if yes, removing its output while in Rolf's approach, an attribute that is there is output if... it was there, simply! >> >> Thanks for comments. >> >> paul > From jdom at tuis.net Fri Mar 23 07:32:56 2012 From: jdom at tuis.net (Rolf Lear) Date: Fri, 23 Mar 2012 10:32:56 -0400 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> References: <46e2c3a3a56ab2177870a95c4ba4fa98@tuis.net> <7548b7e7900688521a8c8c25fee97a1d@tuis.net> <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> <4F6C78E5.3060205@tuis.net> <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> Message-ID: <4F6C8998.8070005@tuis.net> Hi Paul, all. So, I will have to 'eat crow'... the Xerces parser will apply defaulted values even if the DTD is completely broken.... which is odd (note the xxx/yyy discrepency and that 'yyy' is not even declared an Element!). String xml = " ] >"; Document doc = builder.build(new StringReader(xml)); xout.output(doc, System.out); gives .... but, when I test it, the new JDOM Format feature works... Format speconly = Format.getPrettyFormat(); speconly.setSpecifiedAttributesOnly(true); XMLOutputter xout = new XMLOutputter(speconly); xout.output(doc, System.out); gives So, now I am more confused... Xerces will apply a completely broken DTD to a document, (even the root element name is wrong). Further, it will apply the default attribute values, and it will have the right 'flags' on the values when it tells JDOM about them, and JDOM will flag the attribute as 'not specified', and will ignore it when outputting the XML (with the correct flag set on the Format instance). In fact, going back to the original example, I have the following code: public static void main(String[] args) throws JDOMException, IOException { String xml = "http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath"; SAXBuilder builder = new SAXBuilder(); Document doc = builder.build(xml); for (Element e : doc.getDescendants(Filters.element())) { if (e.hasAttributes()) { for (Attribute a : e.getAttributes()) { if (!a.isSpecified()) { System.out.println("Attribute was defaulted " + a); } } } } Format speconly = Format.getPrettyFormat(); speconly.setSpecifiedAttributesOnly(true); XMLOutputter xout = new XMLOutputter(speconly); xout.output(doc, System.out); } And it does exactly what you want.... (except for the namespaces). Rolf On 23/03/2012 9:40 AM, Paul Libbrecht wrote: > Rolf, > > I think your assumption is wrong: I remember Michael Kay had a long FAQ entry about justifying why a DTD was read even though validation was not activated (for Saxon Aelfred which we have extensively used) and indeed it is my experience that any parser, Xerces included, parses the DTD completely (including included entities as is the case here) and injects all default values of attributes (including namespaces) without it being validating. > > Validating implies breaking somehow after an error (the first or the last?). > > To summarize I see the following modes: > - ignore the DTD completely (no parser does this unless explicitly told it) > - use DTD (and inclusions) for all default values > - use DTD and report all errors but keep doing > - use DTD and break at first error > > My understanding is that my SAXBuilder.build was throwing an exception if I activated DTD validation (so the last two possibilities) thus making it impossible obtain a good jdom Document object form a slightly invalid document. > > paul > > PS: sorry for the mailing-fuss, I thought I sent it to the list a bit later realizing that jdom at tuis.net was not... the list... > > > Le 23 mars 2012 ? 14:21, Rolf Lear a ?crit : > From patrick.dowler at nrc-cnrc.gc.ca Fri Mar 23 12:25:20 2012 From: patrick.dowler at nrc-cnrc.gc.ca (Patrick Dowler) Date: Fri, 23 Mar 2012 12:25:20 -0700 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> References: <4F6C78E5.3060205@tuis.net> <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> Message-ID: <6377554.T6PK7Kbu0N@foli> On 2012-03-23 09:40:39 Paul Libbrecht wrote: > my experience that any parser, Xerces included, parses the DTD completely > (including included entities as is the case here) and injects all default > values of attributes (including namespaces) without it being validating. We are using JDOM-1.1 and Xerces and my experience that with an XML schema that provides default values you only get the defaults when doing validation. This is XMLSchema, not DTD, so it is possible that the rules are different... it is also possible that xerces can be configured in more ways... -- Patrick Dowler Tel/T?l: (250) 363-0044 Canadian Astronomy Data Centre National Research Council Canada 5071 West Saanich Road Victoria, BC V9E 2M7 Centre canadien de donnees astronomiques Conseil national de recherches Canada 5071, chemin West Saanich Victoria (C.-B.) V9E 2M7 From jdom at tuis.net Mon Mar 26 08:20:03 2012 From: jdom at tuis.net (Rolf Lear) Date: Mon, 26 Mar 2012 11:20:03 -0400 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <4F6C8998.8070005@tuis.net> References: <46e2c3a3a56ab2177870a95c4ba4fa98@tuis.net> <7548b7e7900688521a8c8c25fee97a1d@tuis.net> <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> <4F6C78E5.3060205@tuis.net> <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> <4F6C8998.8070005@tuis.net> Message-ID: <4d057bd58469f9ae1823893196aba722@tuis.net> Hi Paul, all. Just checking that we have things 'straight'.... since I have not heard back from you. Three things... 1. I said that SAX parser will ignore the DTD unless you 'DTD-validate'... but I was wrong, (at least) Xerces SAX parser will apply the default attribtues. 2. You said "Removing the validation, sadly disables the passing of attribute presence info, it seems." which I think you were saying because I had said that before to you.... but this is wrong .. .see above. Do you have 'evidence' that this is true though? Or are you basing this on what I told you before...? 3. It seems (at least with the Xerces 2.11 parser) that using non-validating SAXBuilder will produce the exact results you want... (with the exception of Namespace declarations). Can you confirm this? I have been thinking about the Namespace problem, and, I don't think that there is a 'good' solution on the 'output' side of things, but, there is value in having a tool that: - scans the entire 'Element' tree - 'explicitly' declares namespaces at the highest common node in the tree This 'tool' will thus 'declare' the namespace at a 'high' level, and the outputter will not need to re-declare them at the lower levels.... which, I would guess, would result in removing many low-level Namespace declarations, and replace them with a single high-level declaration. This would 'solve' your problem in one sense. Another alternative for you would be to manually remove the not-specified attributes just prior to output: for (Element e : doc.getDescendants(Filters.element())) { if (e.hasAttributes()) { for (Iterator it = e.getAttributes().iterator(); it.hasNext();) { Attribute a = it.next(); if (!a.isSpecified()) { it.remove(); } } } } if you physically remove the (default) attributes from the document, then they will not affect the namespace calculations for the output either. Rolf > In fact, going back to the original example, I have the following code: > > public static void main(String[] args) throws JDOMException, IOException { > String xml = > "http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath"; > SAXBuilder builder = new SAXBuilder(); > Document doc = builder.build(xml); > for (Element e : doc.getDescendants(Filters.element())) { > if (e.hasAttributes()) { > for (Attribute a : e.getAttributes()) { > if (!a.isSpecified()) { > System.out.println("Attribute was defaulted " + a); > } > } > } > } > Format speconly = Format.getPrettyFormat(); > speconly.setSpecifiedAttributesOnly(true); > XMLOutputter xout = new XMLOutputter(speconly); > xout.output(doc, System.out); > } > > And it does exactly what you want.... (except for the namespaces). > > Rolf > From jdom at tuis.net Mon Mar 26 19:53:41 2012 From: jdom at tuis.net (Rolf Lear) Date: Mon, 26 Mar 2012 22:53:41 -0400 Subject: [jdom-interest] JOM 2.x BETA Release 5 - the last BETA Message-ID: <4F712BB5.9030405@tuis.net> Hi all. I have just pushed out the last BETA release for JDOM 2.x As with previous BETA releases, the normal pages are: Downloads: https://github.com/hunterhacker/jdom/downloads The Release: https://github.com/downloads/hunterhacker/jdom/jdom2-0.0.5-BETA.zip The APIDocs: http://hunterhacker.github.com/jdom/jdom2/apidocs/index.html The Code Coverage: http://hunterhacker.github.com/jdom/jdom2/coverage/index.html The JUnit Report: http://hunterhacker.github.com/jdom/jdom2/junit.report/index.html Finally, read up on JDOM 2.x here: https://github.com/hunterhacker/jdom/wiki/JDOM-2.0 The changes from BETA-4 are mostly performance and 'Housekeeping'. Have a look at the performance page: http://hunterhacker.github.com/jdom/jdom2/performance.html There are no anticipated changes to happen before the final release. When the final release happens, the differences will be: - primary release to www.jdom.org, and also to github and maven - will be called jdom-2.0.0.jar - will be in the org.jdom group, and jdom artifact on maven. My focus for the next week or so will be updating the jdom.org pages to reflect the new release. When it is ready to go I will be coordinating with Jason to publish the pages at jdom.org. The timing of this is uncertain, and the official JDOM 2.0.0 release may be happening any time in (hopefully early) April. Happy coding Rolf From jdom at tuis.net Tue Mar 27 05:48:08 2012 From: jdom at tuis.net (Rolf Lear) Date: Tue, 27 Mar 2012 08:48:08 -0400 Subject: [jdom-interest] Previous 'Downloads' Message-ID: Hi All. I am about to go through and purge some of the (very) outdated downloads from the github site. I intend to remove all but the BETA releases (and the dev jars zip file). I cannot think of any reason to keep the older files hanging around, and, unless someone can come up with a good reason within the day, I will purge them tomorrow (28th). Rolf From paul at hoplahup.net Tue Mar 27 06:11:15 2012 From: paul at hoplahup.net (Paul Libbrecht) Date: Tue, 27 Mar 2012 15:11:15 +0200 Subject: [jdom-interest] Previous 'Downloads' In-Reply-To: References: Message-ID: <9B45D8DA-9DFC-4C51-84FB-9215F9BA0CF9@hoplahup.net> Not any release or? paul Le 27 mars 2012 ? 14:48, Rolf Lear a ?crit : > > Hi All. > > I am about to go through and purge some of the (very) outdated downloads > from the github site. I intend to remove all but the BETA releases (and the > dev jars zip file). > > I cannot think of any reason to keep the older files hanging around, and, > unless someone can come up with a good reason within the day, I will purge > them tomorrow (28th). > > Rolf > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com From paul at hoplahup.net Tue Mar 27 06:17:18 2012 From: paul at hoplahup.net (Paul Libbrecht) Date: Tue, 27 Mar 2012 15:17:18 +0200 Subject: [jdom-interest] JDOM Issue #5 - DTD-aware Attribute output In-Reply-To: <4d057bd58469f9ae1823893196aba722@tuis.net> References: <46e2c3a3a56ab2177870a95c4ba4fa98@tuis.net> <7548b7e7900688521a8c8c25fee97a1d@tuis.net> <345B4B65-5537-42AE-B345-26BFE779FCA3@hoplahup.net> <4F6C78E5.3060205@tuis.net> <9BC2BF18-F7C3-4F53-AECE-7C8984048AB3@hoplahup.net> <4F6C8998.8070005@tuis.net> <4d057bd58469f9ae1823893196aba722@tuis.net> Message-ID: <7CCD199C-906B-4E26-AF56-80F525B527B1@hoplahup.net> Rolf, Le 26 mars 2012 ? 17:20, Rolf Lear a ?crit : > Just checking that we have things 'straight'.... since I have not heard > back from you. I'm sorry, I'm swamped and have to accept it. > Three things... > 1. I said that SAX parser will ignore the DTD unless you 'DTD-validate'... > but I was wrong, (at least) Xerces SAX parser will apply the default > attribtues. And Saxon AElfred. Xerces is the default in all java distributions of Sun and Apple, or? (Sun's Xerces that is) > 2. You said "Removing the validation, sadly disables the passing of > attribute presence info, it seems." which I think you were saying because I > had said that before to you.... but this is wrong .. .see above. Do you > have 'evidence' that this is true though? Or are you basing this on what I > told you before...? The example I provided showed it to me: passing no parameter to new SAXBuilder will avoid, with Xerces' default, the passing of any location info. I'm sorry, I'll have to accept jDOM 2.0 to go out without evaluating a full solution... paul > 3. It seems (at least with the Xerces 2.11 parser) that using > non-validating SAXBuilder will produce the exact results you want... (with > the exception of Namespace declarations). Can you confirm this? > > I have been thinking about the Namespace problem, and, I don't think that > there is a 'good' solution on the 'output' side of things, but, there is > value in having a tool that: > - scans the entire 'Element' tree > - 'explicitly' declares namespaces at the highest common node in the tree > > This 'tool' will thus 'declare' the namespace at a 'high' level, and the > outputter will not need to re-declare them at the lower levels.... which, I > would guess, would result in removing many low-level Namespace > declarations, and replace them with a single high-level declaration. This > would 'solve' your problem in one sense. > > Another alternative for you would be to manually remove the not-specified > attributes just prior to output: > > for (Element e : doc.getDescendants(Filters.element())) { > if (e.hasAttributes()) { > for (Iterator it = e.getAttributes().iterator(); > it.hasNext();) { > Attribute a = it.next(); > if (!a.isSpecified()) { > it.remove(); > } > } > } > } > > if you physically remove the (default) attributes from the document, then > they will not affect the namespace calculations for the output either. > > Rolf > > >> In fact, going back to the original example, I have the following code: >> >> public static void main(String[] args) throws JDOMException, > IOException { >> String xml = >> > "http://svn.activemath.org/LeAM-calculus/LeAM_calculus/oqmath/contin.oqmath"; >> SAXBuilder builder = new SAXBuilder(); >> Document doc = builder.build(xml); >> for (Element e : doc.getDescendants(Filters.element())) { >> if (e.hasAttributes()) { >> for (Attribute a : e.getAttributes()) { >> if (!a.isSpecified()) { >> System.out.println("Attribute was defaulted " + a); >> } >> } >> } >> } >> Format speconly = Format.getPrettyFormat(); >> speconly.setSpecifiedAttributesOnly(true); >> XMLOutputter xout = new XMLOutputter(speconly); >> xout.output(doc, System.out); >> } >> >> And it does exactly what you want.... (except for the namespaces). >> >> Rolf >> From jdom at tuis.net Tue Mar 27 06:17:13 2012 From: jdom at tuis.net (Rolf Lear) Date: Tue, 27 Mar 2012 09:17:13 -0400 Subject: [jdom-interest] Previous 'Downloads' In-Reply-To: <9B45D8DA-9DFC-4C51-84FB-9215F9BA0CF9@hoplahup.net> References: <9B45D8DA-9DFC-4C51-84FB-9215F9BA0CF9@hoplahup.net> Message-ID: Oh, I mean the 'zip' files on https://github.com/hunterhacker/jdom/downloads page. So, I intend to remove the files like: jdom-2.x-2011.12.07.09.37.zip and other files that are *not* BETA releases. In fact, I think the complete list of files to be removed is: jdom-1.1.2.hf1.zip jdom-2.x-2012.02.02.22.29.zip jdom-2.x-2012.01.* jdom-2.x-2011.* There should be no reason for anyone to want these files, and they can be rebuilt from the git history anyway. Rolf On Tue, 27 Mar 2012 15:11:15 +0200, Paul Libbrecht wrote: > Not any release or? > > paul > > > Le 27 mars 2012 ? 14:48, Rolf Lear a ?crit : > >> >> Hi All. >> >> I am about to go through and purge some of the (very) outdated downloads >> from the github site. I intend to remove all but the BETA releases (and >> the >> dev jars zip file). >> >> I cannot think of any reason to keep the older files hanging around, and, >> unless someone can come up with a good reason within the day, I will >> purge >> them tomorrow (28th). >> >> Rolf >> _______________________________________________ >> To control your jdom-interest membership: >> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com From curoli at gmail.com Thu Mar 29 06:23:36 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Thu, 29 Mar 2012 09:23:36 -0400 Subject: [jdom-interest] Simple xhtml/entity resolver? Message-ID: Hello, I need a simple way to convert some XHTML fragments, provided as a JDOM Element, into plain text. I am willing to ignore most HTML tags and consider only the most commonly used predefined entities. In JDOM, an entity reference has a name, a public id and a system id. I think I know what the named means, for named entities. But what about numeric entities, how do I get the code point? And what are public id and system id? Thanks! Take care Oliver -- Oliver Ruebenacker, Computational Cell Biologist Virtual Cell (http://vcell.org) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) http://www.oliver.curiousworld.org From jdom at tuis.net Thu Mar 29 06:46:51 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 29 Mar 2012 09:46:51 -0400 Subject: [jdom-interest] =?utf-8?q?Simple_xhtml/entity_resolver=3F?= In-Reply-To: References: Message-ID: <7b0c0e81a2cc0a5076eee8949ffd565f@tuis.net> Hi Oliver. If you already have the XHTML content as JDOM Elements, then you should be able to (just) do: XMLOutputter xout = new XMLOutputter(); String fragment = xout.outputString(element); If you want to change the format of the output (indenting, etc.), you can add a 'Format' to the XMLOutputter with: XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat()); String fragment = xout.outputString(element); I think you may be chasing a red-herring with the Entity References. The EntityRef code is a 'CYA' implementation, but, in reality, the SystemID and PublicID are never going to be needed in regular usage. The only place I know of where you have entity references is if you specify your input parser should ignore entity-reference lookups when parsing, and in JDOM you will end up with an EntityRef instead of it's 'underlying' text. Rolf On Thu, 29 Mar 2012 09:23:36 -0400, Oliver Ruebenacker wrote: > Hello, > > I need a simple way to convert some XHTML fragments, provided as a > JDOM Element, into plain text. I am willing to ignore most HTML tags > and consider only the most commonly used predefined entities. > > In JDOM, an entity reference has a name, a public id and a system > id. I think I know what the named means, for named entities. But what > about numeric entities, how do I get the code point? And what are > public id and system id? > > Thanks! > > Take care > Oliver From curoli at gmail.com Thu Mar 29 07:51:47 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Thu, 29 Mar 2012 10:51:47 -0400 Subject: [jdom-interest] Fwd: Simple xhtml/entity resolver? In-Reply-To: References: <7b0c0e81a2cc0a5076eee8949ffd565f@tuis.net> Message-ID: ? ? Hello, (forwarding this to the list, as I accidentally only sent to Rolf) ?I think there is a misunderstanding. I don't want to output as XML. I want to render the XHTML as text like a very primitive browser would display it. ?I'm building a String by traversing the tree by calling Element.getContent(). For example, a ? can be encoded in XML as "©". Presumably, the Element tree would contain an EntityRef with name "copy". But what if an XML document contains "&169;" or "&x00A9;"? How would the EntityRef object look like? ? Thanks! ? ? Take care ? ? Oliver On Thu, Mar 29, 2012 at 9:46 AM, Rolf Lear wrote: > > Hi Oliver. > > If you already have the XHTML content as JDOM Elements, then you should be > able to (just) do: > > XMLOutputter xout = new XMLOutputter(); > String fragment = xout.outputString(element); > > If you want to change the format of the output (indenting, etc.), you can > add a 'Format' to the XMLOutputter with: > > XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat()); > String fragment = xout.outputString(element); > > > I think you may be chasing a red-herring with the Entity References. > > The EntityRef code is a 'CYA' implementation, but, in reality, the > SystemID and PublicID are never going to be needed in regular usage. > > The only place I know of where you have entity references is if you > specify your input parser should ignore entity-reference lookups when > parsing, and in JDOM you will end up with an EntityRef instead of it's > 'underlying' text. > > Rolf > > > On Thu, 29 Mar 2012 09:23:36 -0400, Oliver Ruebenacker > wrote: >> Hello, >> >> ? I need a simple way to convert some XHTML fragments, provided as a >> JDOM Element, into plain text. I am willing to ignore most HTML tags >> and consider only the most commonly used predefined entities. >> >> ? In JDOM, an entity reference has a name, a public id and a system >> id. I think I know what the named means, for named entities. But what >> about numeric entities, how do I get the code point? And what are >> public id and system id? >> >> ? Thanks! >> >> ? ? ?Take care >> ? ? ?Oliver -- Oliver Ruebenacker, Computational Cell Biologist Virtual Cell (http://vcell.org) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) http://www.oliver.curiousworld.org -- Oliver Ruebenacker, Computational Cell Biologist Virtual Cell (http://vcell.org) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) http://www.oliver.curiousworld.org From jdom at tuis.net Thu Mar 29 08:25:54 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 29 Mar 2012 11:25:54 -0400 Subject: [jdom-interest] =?utf-8?q?Fwd=3A_Re=3A__Simple_xhtml/entity_resol?= =?utf-8?q?ver=3F?= Message-ID: <5061d45d2e24195c92e7f257a1add197@tuis.net> and I replied to Olive only too... hmmm Rolf -------- Original Message -------- Subject: Re: [jdom-interest] Simple xhtml/entity resolver? Date: Thu, 29 Mar 2012 11:15:26 -0400 From: Rolf Lear To: Oliver Ruebenacker Ahh, In order to discuss the 'entity' processing, you need to be careful about how you specify the 'location' of the data... For example, there are three basic 'locations' for content when we consider JDOM, the 'unparsed XML', the 'JDOM Document', and the 'output'. Also, when you say &169; do you mean &169; or do you actually mean ? ? There is a *big* difference.... When you parse 'unparsed XML' the parser will always translate character escapes to the actual character, for example, ? will become ?. JDOM will never see the '?'. If, for example, in the 'unparsed XML' file, you had , then, when parsed and given to JDOM, you will always have the single char ? as root.getAttributeValue("att"). When you output that value from JDOM, JDOM will use the 'charset' of the output destination to determine whether the ? char needs to be escaped. For example, the following 'program': SAXBuilder builder = new SAXBuilder(); Document doc = builder.build(new StringReader("")); System.out.println(doc.getRootElement().getAttributeValue("att")); XMLOutputter xout = new XMLOutputter(); xout.output(doc, System.out); outputs: ? Having said that, you must understand that JDOM *expects* to be given 'un-escaped' data. If you tell JDOM to set the value for attribute 'attb' to the String '?' then JDOM will do that, and, when you output the value, it will escape the '&' for you so that the value '?' is preserved.... for example, if we add the following lines to the above program: doc.getRootElement().setAttribute("attb", "?"); xout.output(doc, System.out); the output is now: ? So, making sure that we have a good understanding of the concept of character escapes, you must realize that they are *not* EntityReferences... you should never see any JDOM object representing a character escape. On the other hand, if you had the entity reference '?' in your 'unparsed XML', the parser (by default) should have replaced it with the appropriate character(s) when the document was parsed. Again, JDOM will see the character ? and not the reference '?'. A 'default' parser will fail to parse a document if it has references that cannot be resolved. If you change the default parse behaviour (to remove the entity-resolve process), then instead of the ? character, you will have a JDOM EntityRef with the name 'copy'. In other words, you have to go out of your way to create EntityRef instances. If you want to ignore the processes the parser uses to resolve entities, then you will need to scan the JDOM tree, look for EntityRefs, and manually replace them with the appropriate Text.... using whatever strategy you want to use. In a more general answer to your original question 'how do I basically replace a browser', though, what you really want to be doing is a Transform on your JDOM document, to create an appropriate output for your needs. The transform you use will depend on what results you want. Have a look at XSLTransform class in JDOM, as well as the various resources on the net for XSL Transformations. Rolf On Thu, 29 Mar 2012 10:28:26 -0400, Oliver Ruebenacker wrote: > Hello Rolf, > > I think there is a misunderstanding. I don't want to output as XML. > I want to render the XHTML as text like a very primitive browser would > display it. > > I'm building a String by traversing the tree by calling > Element.getContent(). For example, a ? can be encoded in XML as > "?". Presumably, the Element tree would contain an EntityRef with > name "copy". But what if an XML document contains "&169;" or > "&x00A9;"? How would the EntityRef object look like? > > Thanks! > > Take care > Oliver > > On Thu, Mar 29, 2012 at 9:46 AM, Rolf Lear wrote: >> >> Hi Oliver. >> >> If you already have the XHTML content as JDOM Elements, then you should >> be >> able to (just) do: >> >> XMLOutputter xout = new XMLOutputter(); >> String fragment = xout.outputString(element); >> >> If you want to change the format of the output (indenting, etc.), you can >> add a 'Format' to the XMLOutputter with: >> >> XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat()); >> String fragment = xout.outputString(element); >> >> >> I think you may be chasing a red-herring with the Entity References. >> >> The EntityRef code is a 'CYA' implementation, but, in reality, the >> SystemID and PublicID are never going to be needed in regular usage. >> >> The only place I know of where you have entity references is if you >> specify your input parser should ignore entity-reference lookups when >> parsing, and in JDOM you will end up with an EntityRef instead of it's >> 'underlying' text. >> >> Rolf >> >> From jdom at tuis.net Thu Mar 29 08:35:44 2012 From: jdom at tuis.net (Rolf Lear) Date: Thu, 29 Mar 2012 11:35:44 -0400 Subject: [jdom-interest] =?utf-8?q?Simple_xhtml/entity_resolver=3F?= In-Reply-To: <96cbb21ac35d0ff0126a8358085bfd74@tuis.net> References: <7b0c0e81a2cc0a5076eee8949ffd565f@tuis.net> <96cbb21ac35d0ff0126a8358085bfd74@tuis.net> Message-ID: <230a1b53e5a9137aef3464068d1e5756@tuis.net> Discussing character escapes using a web-based mail client is probably not the smartest thing I have done... Especially complicated when replies are made, etc. Sorry, but in the 'second example', should read: > Having said that, you must understand that JDOM *expects* to be given > 'un-escaped' data. If you tell JDOM to set the value for attribute 'attb' > to the (expanded with a space to preserve formatting) String '& #169;' then JDOM will do that, and, when you output the > value, it will escape the '&' for you so that the value (expanded with a space to preserve formatting) '& #169;' is > preserved.... for example, if we add the following lines to the above > program: > > doc.getRootElement().setAttribute("attb", "& #169;"); // expanded with a space > xout.output(doc, System.out); > > the output is now: > > ? > > > > note how I have expanded the char escapes with a space to preserve formatting... this may just make things more complicated... I don't know. Rolf On Thu, 29 Mar 2012 11:15:26 -0400, Rolf Lear wrote: > Ahh, > > In order to discuss the 'entity' processing, you need to be careful about > how you specify the 'location' of the data... > > For example, there are three basic 'locations' for content when we > consider JDOM, the 'unparsed XML', the 'JDOM Document', and the 'output'. > > Also, when you say &169; do you mean &169; or do you actually mean ? > ? There is a *big* difference.... > > When you parse 'unparsed XML' the parser will always translate character > escapes to the actual character, for example, ? will become ?. JDOM > will never see the '?'. If, for example, in the 'unparsed XML' file, > you had , then, when parsed and given to JDOM, you will > always have the single char ? as root.getAttributeValue("att"). > > When you output that value from JDOM, JDOM will use the 'charset' of the > output destination to determine whether the ? char needs to be escaped. > For > example, the following 'program': > > SAXBuilder builder = new SAXBuilder(); > Document doc = builder.build(new StringReader("")); > System.out.println(doc.getRootElement().getAttributeValue("att")); > XMLOutputter xout = new XMLOutputter(); > xout.output(doc, System.out); > > outputs: > > ? > > > > > Having said that, you must understand that JDOM *expects* to be given > 'un-escaped' data. If you tell JDOM to set the value for attribute 'attb' > to the String '?' then JDOM will do that, and, when you output the > value, it will escape the '&' for you so that the value '?' is > preserved.... for example, if we add the following lines to the above > program: > > doc.getRootElement().setAttribute("attb", "?"); > xout.output(doc, System.out); > > the output is now: > > ? > > > > > > > So, making sure that we have a good understanding of the concept of > character escapes, you must realize that they are *not* EntityReferences... > you should never see any JDOM object representing a character escape. > > On the other hand, if you had the entity reference '?' in your > 'unparsed XML', the parser (by default) should have replaced it with the > appropriate character(s) when the document was parsed. Again, JDOM will see > the character ? and not the reference '?'. A 'default' parser will > fail to parse a document if it has references that cannot be resolved. If > you change the default parse behaviour (to remove the entity-resolve > process), then instead of the ? character, you will have a JDOM EntityRef > with the name 'copy'. > > In other words, you have to go out of your way to create EntityRef > instances. If you want to ignore the processes the parser uses to resolve > entities, then you will need to scan the JDOM tree, look for EntityRefs, > and manually replace them with the appropriate Text.... using whatever > strategy you want to use. > > > > In a more general answer to your original question 'how do I basically > replace a browser', though, what you really want to be doing is a Transform > on your JDOM document, to create an appropriate output for your needs. The > transform you use will depend on what results you want. Have a look at > XSLTransform class in JDOM, as well as the various resources on the net for > XSL Transformations. > > > Rolf > > > > On Thu, 29 Mar 2012 10:28:26 -0400, Oliver Ruebenacker > wrote: >> Hello Rolf, >> >> I think there is a misunderstanding. I don't want to output as XML. >> I want to render the XHTML as text like a very primitive browser would >> display it. >> >> I'm building a String by traversing the tree by calling >> Element.getContent(). For example, a ? can be encoded in XML as >> "?". Presumably, the Element tree would contain an EntityRef with >> name "copy". But what if an XML document contains "&169;" or >> "&x00A9;"? How would the EntityRef object look like? >> >> Thanks! >> >> Take care >> Oliver >> >> On Thu, Mar 29, 2012 at 9:46 AM, Rolf Lear wrote: >>> >>> Hi Oliver. >>> >>> If you already have the XHTML content as JDOM Elements, then you should >>> be >>> able to (just) do: >>> >>> XMLOutputter xout = new XMLOutputter(); >>> String fragment = xout.outputString(element); >>> >>> If you want to change the format of the output (indenting, etc.), you > can >>> add a 'Format' to the XMLOutputter with: >>> >>> XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat()); >>> String fragment = xout.outputString(element); >>> >>> >>> I think you may be chasing a red-herring with the Entity References. >>> >>> The EntityRef code is a 'CYA' implementation, but, in reality, the >>> SystemID and PublicID are never going to be needed in regular usage. >>> >>> The only place I know of where you have entity references is if you >>> specify your input parser should ignore entity-reference lookups when >>> parsing, and in JDOM you will end up with an EntityRef instead of it's >>> 'underlying' text. >>> >>> Rolf >>> >>> From olivier.jaquemet at jalios.com Thu Mar 29 08:47:33 2012 From: olivier.jaquemet at jalios.com (Olivier Jaquemet) Date: Thu, 29 Mar 2012 17:47:33 +0200 Subject: [jdom-interest] Simple xhtml/entity resolver? In-Reply-To: References: Message-ID: <4F748415.2040507@jalios.com> Hi Oliver, JDom is a great tool for parsing XML... ... but for XHTML fragment (which may not be completely XHTML compliant ... ?) and specially for text extraction, I would strongly suggest JSoup http://jsoup.org/ String text = org.jsoup.Jsoup.parse(html).text(); Whatever is your html it will work like a charm (even it is an ugly copy paste wysiwyg from word or any ugly html export from whatever website) Olivier On 29/03/2012 15:23, Oliver Ruebenacker wrote: > Hello, > > I need a simple way to convert some XHTML fragments, provided as a > JDOM Element, into plain text. I am willing to ignore most HTML tags > and consider only the most commonly used predefined entities. > > In JDOM, an entity reference has a name, a public id and a system > id. I think I know what the named means, for named entities. But what > about numeric entities, how do I get the code point? And what are > public id and system id? > > Thanks! > > Take care > Oliver > -- Olivier Jaquemet Ing?nieur R&D Jalios S.A. - http://www.jalios.com/ @OlivierJaquemet +33970461480 From curoli at gmail.com Thu Mar 29 09:54:45 2012 From: curoli at gmail.com (Oliver Ruebenacker) Date: Thu, 29 Mar 2012 12:54:45 -0400 Subject: [jdom-interest] Simple xhtml/entity resolver? In-Reply-To: References: <4F748415.2040507@jalios.com> Message-ID: Hello, Thanks for all the advice, but it seems I did not make myself sufficiently clear. My situation is this: some one else already parsed XHTML and gave me the JDOM element that represents a fragment of it. Let us say the original fragment looks something like this: "

© 2012 by Dewey, Cheetham & Howe

" "

© 2012 by Dewey, Cheetham & Howe

" "

© 2012 by Dewey, Cheetham  Howe

" I never get to see that fragment, but instead an object of type Element. What I want to get is a String that looks roughly like this: "? 2012 by Dewey, Cheetham & Howe" A simple lightweight solution that is roughly acceptable in most simple cases is fine for my purpose. So I am trying a recursive method that iterates over Element.getContent() and then I am wondering what to do if the content happens to be EntityRef? package cbit.vcell.model.summaries; import org.jdom.Comment; import org.jdom.DocType; import org.jdom.Element; import org.jdom.EntityRef; import org.jdom.ProcessingInstruction; import org.jdom.Text; public class XHTMLToPlainTextConverter { public static String convert(Element element) { String text = ""; for(Object content : element.getContent()) { if(content instanceof Comment) { // ignore } else if(content instanceof DocType) { // ignore } else if(content instanceof Element) { Element childElement = (Element) content; text = text + convert(childElement); } else if(content instanceof EntityRef) { EntityRef ref = (EntityRef) content; text = text + ref; // ??? } else if(content instanceof ProcessingInstruction) { // ignore } else if(content instanceof Text) { Text childText = (Text) content; text = text + childText.getText(); } else { // ignore, should not happen } } return text; } } Thanks! Take care Oliver On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt wrote: > Another option I've used in the past is changing the underlying SAX parser > that jDOM uses to TagSoup ( http://ccil.org/~cowan/XML/tagsoup/). ?Their > parser is tuned to parsing not fully XML compliant HTML. > > ? (*Chris*) > > On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet > wrote: >> >> Hi Oliver, >> >> JDom is a great tool for parsing XML... >> >> ... but for XHTML fragment (which may not be completely XHTML compliant >> ... ?) >> and specially for text extraction, I would strongly suggest JSoup >> http://jsoup.org/ >> >> ?String text = org.jsoup.Jsoup.parse(html).text(); >> >> Whatever is your html it will work like a charm (even it is an ugly copy >> paste wysiwyg from word or any ugly html export from whatever website) >> >> Olivier >> >> >> On 29/03/2012 15:23, Oliver Ruebenacker wrote: >>> >>> ? ? ?Hello, >>> >>> ? I need a simple way to convert some XHTML fragments, provided as a >>> JDOM Element, into plain text. I am willing to ignore most HTML tags >>> and consider only the most commonly used predefined entities. >>> >>> ? In JDOM, an entity reference has a name, a public id and a system >>> id. I think I know what the named means, for named entities. But what >>> about numeric entities, how do I get the code point? And what are >>> public id and system id? >>> >>> ? Thanks! >>> >>> ? ? ?Take care >>> ? ? ?Oliver >>> >> >> -- >> Olivier Jaquemet >> Ing?nieur R&D Jalios S.A. - http://www.jalios.com/ >> @OlivierJaquemet +33970461480 >> >> >> >> _______________________________________________ >> To control your jdom-interest membership: >> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > > > > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com -- Oliver Ruebenacker, Computational Cell Biologist Virtual Cell (http://vcell.org) SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) http://www.oliver.curiousworld.org From paul at hoplahup.net Thu Mar 29 14:24:26 2012 From: paul at hoplahup.net (Paul Libbrecht) Date: Thu, 29 Mar 2012 23:24:26 +0200 Subject: [jdom-interest] Simple xhtml/entity resolver? In-Reply-To: References: <4F748415.2040507@jalios.com> Message-ID: <0E465880-5177-4A0B-8D87-68CF8F703246@hoplahup.net> Oliver, I'm curious, did you ever get an entityRef? To my experience, no SAXBuilder gives you them... Also, they will transform any numeric reference to a character. Now, still, I tried to respond to your request and I could not. Watching the XMLOutputter, I saw that it was actually outputting the entity ref itself (namely: the ampersand, the name, a semicolon), and indeed the EntityRef object does not carry any information that allows you to "resolve it". The last step, entity-resolution, actually is the business of the DTD. The Entity-references for xhtml are among the reasons of the xhtml dtd's enormous weight. If I remember well, mathml has an entity-definition-table that may be easier to process (also available as xml in case). Also, beware if you want to parse XHTML: - with a DTD, and without some "public/private catalog", you get a DTD loaded from W3C very slowly (and denying after a while) - without it, all entity-references are broken. ... maybe you don't parse it? All in all, could I conjecture the entity-ref objects are actually programmatically created? If yes, you need to expand them as a programme using the table mentioned above (could be a nice contrib). hope it helps. paul Le 29 mars 2012 ? 18:54, Oliver Ruebenacker a ?crit : > Hello, > > Thanks for all the advice, but it seems I did not make myself > sufficiently clear. > > My situation is this: some one else already parsed XHTML and gave me > the JDOM element that represents a fragment of it. > > Let us say the original fragment looks something like this: > > "

© 2012 by Dewey, Cheetham & Howe

" > "

© 2012 by Dewey, Cheetham & Howe

" > "

© 2012 by Dewey, Cheetham  Howe

" > > I never get to see that fragment, but instead an object of type > Element. What I want to get is a String that looks roughly like this: > > "? 2012 by Dewey, Cheetham & Howe" > > A simple lightweight solution that is roughly acceptable in most > simple cases is fine for my purpose. > > So I am trying a recursive method that iterates over > Element.getContent() and then I am wondering what to do if the content > happens to be EntityRef? > > package cbit.vcell.model.summaries; > > import org.jdom.Comment; > import org.jdom.DocType; > import org.jdom.Element; > import org.jdom.EntityRef; > import org.jdom.ProcessingInstruction; > import org.jdom.Text; > > public class XHTMLToPlainTextConverter { > > public static String convert(Element element) { > String text = ""; > for(Object content : element.getContent()) { > if(content instanceof Comment) { > // ignore > } else if(content instanceof DocType) { > // ignore > } else if(content instanceof Element) { > Element childElement = (Element) content; > text = text + convert(childElement); > } else if(content instanceof EntityRef) { > EntityRef ref = (EntityRef) content; > text = text + ref; // ??? > } else if(content instanceof ProcessingInstruction) { > // ignore > } else if(content instanceof Text) { > Text childText = (Text) content; > text = text + childText.getText(); > } else { > // ignore, should not happen > } > } > return text; > } > > } > > Thanks! > > Take care > Oliver > > On Thu, Mar 29, 2012 at 12:19 PM, Chris Pratt wrote: >> Another option I've used in the past is changing the underlying SAX parser >> that jDOM uses to TagSoup ( http://ccil.org/~cowan/XML/tagsoup/). Their >> parser is tuned to parsing not fully XML compliant HTML. >> >> (*Chris*) >> >> On Thu, Mar 29, 2012 at 8:47 AM, Olivier Jaquemet >> wrote: >>> >>> Hi Oliver, >>> >>> JDom is a great tool for parsing XML... >>> >>> ... but for XHTML fragment (which may not be completely XHTML compliant >>> ... ?) >>> and specially for text extraction, I would strongly suggest JSoup >>> http://jsoup.org/ >>> >>> String text = org.jsoup.Jsoup.parse(html).text(); >>> >>> Whatever is your html it will work like a charm (even it is an ugly copy >>> paste wysiwyg from word or any ugly html export from whatever website) >>> >>> Olivier >>> >>> >>> On 29/03/2012 15:23, Oliver Ruebenacker wrote: >>>> >>>> Hello, >>>> >>>> I need a simple way to convert some XHTML fragments, provided as a >>>> JDOM Element, into plain text. I am willing to ignore most HTML tags >>>> and consider only the most commonly used predefined entities. >>>> >>>> In JDOM, an entity reference has a name, a public id and a system >>>> id. I think I know what the named means, for named entities. But what >>>> about numeric entities, how do I get the code point? And what are >>>> public id and system id? >>>> >>>> Thanks! >>>> >>>> Take care >>>> Oliver >>>> >>> >>> -- >>> Olivier Jaquemet >>> Ing?nieur R&D Jalios S.A. - http://www.jalios.com/ >>> @OlivierJaquemet +33970461480 >>> >>> >>> >>> _______________________________________________ >>> To control your jdom-interest membership: >>> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com >> >> >> >> _______________________________________________ >> To control your jdom-interest membership: >> http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com > > > > -- > Oliver Ruebenacker, Computational Cell Biologist > Virtual Cell (http://vcell.org) > SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org) > http://www.oliver.curiousworld.org > > _______________________________________________ > To control your jdom-interest membership: > http://www.jdom.org/mailman/options/jdom-interest/youraddr at yourhost.com