[jdom-interest] In the 11th Hour

Fri Mar 23 05:49:56 PDT 2012

Hi Rolf,
	Thank you very much for the performance details! 
	I definitely think you made me a better java programmer.
Mike Brenner

-----------------------------------------------
From: Rolf Lear [mailto:jdom at tuis.net] 
Sent: Thursday, March 22, 2012 6:02 PM
To: Brenner, Mike
Cc: jdom-interest at jdom.org
Subject: Re: [jdom-interest] In the 11th Hour

On 22/03/2012 2:47 PM, Brenner, Mike wrote:
> Hi Rolf,
> 	I hope you decide to take the time to post how you accomplished this,  to make us all better java programmers.

Hi Mike.

Thanks... I think.... but I have to admit, I am still learning things 
too (and have been for years... which I think helps).

It helps that I have spent years doing exactly this sort of thing 
(high-efficiency Java code) as a regular job. I typically focus more on 
the memory side of things because memory typically is the first thing to 
break with Java... when you run out of -Xmx it just dies, but if it runs 
slow, well, there's always more days to process things in... ;-)

My Java work experience is very useful though... I typically work on 
'big' machines, in the finance arena, where setting -Xmx64g is 
'routine', and processing tens of millions of records in memory at one 
time is also common. Having lots of memory creates bad habits in 
programmers, and when you develop your applications with small scale 
data, it is easy to set -Xmx2g, build your system, and then, when you 
scale the system to big data, it all blows up because you have 
non-linear scalability. That's sort of where I get involved... ;-)

So, if you take the specific example of the 'Scan' times, I can go 
through the process I took to get it fast (again).

The first trick is to have realistic expectations of what results you 
should see. With this code, when I started with JDOM 2.x I was able to 
take a benchmark of the code, and I have been able to refer back to 
that. That creates a 'minimum' expectation. If I had not done that then 
I would not have known there was a problem. I am a believer that 
performance should be 'regression tested' as much as the actual 
functionality.

So, having identified the scan time as a problem, I just took the scan 
benchmark code and wrote a loop around it....

	int cnt = 0;
	long start = System.currentTimeMillis();
	while (true) {
		cnt++;
		int sz = 0;
		for (Content e : doc.getDescendants()) {
			sz++;
		}
		if ((cnt % 1000) == 0) {
			long time = System.currentTimeMillis() - start;
			start = start + time;
			System.out.printf("Looped %d times @ %.3fms each (%d lines)\n", cnt, 
time / 1000.0, sz);
		}
	}

The challenge is then simply to make the time-per-thousand as small as 
possible.

Then, while that is running, you launch the trusty 
$JAVA_HOME/bin/jvisualvm , and you profile the code.

With some tweaking of the settings, you should be looking for three 
things: CPU Hogs, Memory Hogs, and Tight Loops

- Profile the CPU time, and once the profiling has taken 'some time' (20 
seconds or so), you create a snapshot of the profiling. Look at the 
profile (not the snapshot), and get an idea of the methods that took the 
most time. Then, in the snapshot, dig down in to it to find out where 
the code is being called from, why it is called, how often, and where it 
is spending its time. This is a little bit of 'black magic' to identify 
where the 'hot spot' is. You need to be able to look at the source code, 
switch to the snapshot, and back again, and so on.

- while you are in this snapshot, look for items that are called very 
often. Even a small improvement on 'tight' code will make a big 
difference. (I have spent a lot of time working on the 'Verifier' for 
example).

Then, once you have identified the 'hot' code you can look at the code, 
and decide if things could be done differently. Even a small change can 
make a difference. Once you make a small change, re-run the test loop 
(no profiling). Check to see whether it is any faster (or slower).

When I start running out of things in the CPU profiles to 'tweak', I 
start looking at the memory side of things. Often you see a lot of 
memory allocation in strange patterns. Use the Memory profiler in 
jvisualvm (make sure you change the setting to enable "Record 
allocation's stack trace").

Getting back to the 'scan' loops, I tried a number of different changes 
to get things better. Changing the whole iterator around (making 
hasNext() do nothing, and putting all the logic in to 'next()') brought 
it from 3.2ms to about 2.7ms... which was great, but not as good as JDOM 
1.x. So, I after looking at things some more, I decided to try to 
replace the LinkedList as a stack. I know this sort of thing makes a 
difference (from experience). I was somewhat surprised by how much of a 
difference it made though....

For the record, the reason I chose the DescendantIterator to work on is 
because I had already gone through this sort of exercise with the 
ContentList Iterators before (bringing the XPath code from 90ms to 
30ms), so I did not think I would find much there. If I was starting 
'fresh' I would have looked at the lower levels some more too.

Once I had make a good change, I reran the full 'benchmark' and checked 
the results.

A more interesting fix was the one on the FormatStack. I introduced the 
FormatStack itself to solve a performance problem - having to keep 
re-calculating the 'state' of the output formats... But, when I ran the 
XMLOutputter code through the Memory profiler I found a lot of time was 
spent in the Format.escapeText() method. This is a very tight method 
call, and I tried really hard to make it faster. I did a few things 
which made a difference, but, when I changed from 'Raw' format to 
'Pretty' format, it was still very slow. I cold not find the slow-down 
until I checked the memory profiler, and identified that a huge amount 
of char[] memory was being created in the FormatStack.push() method.

Memory is a double-edged sword in Java (and other languages). For every 
byte you allocate, you also need to collect it. Allocating is slow, and 
collecting is slow (especially if it is a 'full' collection - 
single-threaded).

So, as soon as I saw all that memory, and realized that it was coming 
from 'build the indent string' type string concatenation, I found a way 
to not do that unless 'things change', so, now the FormatStack reuses 
the indents. This made a huge impact on performance. Saving the 
allocation of memory pays for itself three times over: not having to do 
the calculation of the new value, not having to allocate and store the 
result, and not having to GC it later.

Additionally, because JDOM has a comprehensive test harness, it is easy 
to make reliable changes, even if they would otherwise be called 'big'. 
It becomes 'easy' to play with things and see how it changes.

Yet another interesting one is the 'escapeText()' code in Format. It 
used to look like:

public boolean shouldEscape(char ch) {
	if (bits == 16) {
		if (Verifier.isHighSurrogate(ch)) {
			return true;  // Safer this way per 
http://unicode.org/faq/utf_bom.html#utf8-4
		}
		return false;
	}
	if (bits == 8) {
		if (ch > 255) {
			return true;
		}
		return false;
	}
	if (bits == 7) {
		if (ch > 127) {
			return true;
		}
		return false;
	}
	if (Verifier.isHighSurrogate(ch))
		return true;  // Safer this way per 
http://unicode.org/faq/utf_bom.html#utf8-4

	if (encoder != null) {
		try {
			return !encoder.canEncode(ch);
		}
		catch (Exception ignored) {
			// ignore problems.
		}
	}
	// Return false if we don't know.  This risks not escaping
	// things which should be escaped, but also means people won't
	// start getting loads of unnecessary escapes.
	return false;
}

That is a 'big' method, especially when it is called for *every* 
character that is output by JDOM. The thing is that the nature of the 
output (7bit, 8bit, 16bit, Charset, or default) is known when the Format 
instance is created. Having to check again for each character is 
unnecessary. So, I changed the code to have different escape strategies 
for each type, and replaced the 'Default' with one of a set of them. 
Now, for example, if you are outputting to a UTF-8 based output, the 
method is simply:

	public final boolean shouldEscape(final char ch) {
		return Verifier.isHighSurrogate(ch);
	}

And, since I have different instances of the class for different 
'shouldEscape()' implementations, the decision is made once for each 
Format, instead of once for each char.

Summary....

Here are some 'tips' I have come to learn, and they influence my coding 
style, and how I look at performance problems.

Some of them have bad reputations... and people complain about 
'optimizing too early', ... but, it works for me.

Use 'final' as liberally as possible. Classes, methods, variables, 
parameters, everything.

I have learned, over the years, to be very wary of the Collections API 
when it comes to memory and performance. They are all extreme memory 
hogs, and as a result, they are slow... Be careful using them in 
performance-critical code. arrays of primitives are often just as easy 
to use, and are faster.

while-loops with break/continue statements are faster than conditionals 
in the loop.

loops with a constant test are much faster than loops with 
conditionals... for example:

   for (int i = 0; i < list.size(); i++) {
     ....
   }

the above is fine, but this is better:

   final int len = list.size();
   for (int i = 0; i < len; i++) {
     ...
   }

When you can, do count-down loops instead:
   int idx = list.size();
   while (--index >= 0) {
     ....
   }