Freelists are a common way to speed up allocation by reusing existing memory that was already allocated. Is there a way to use free-lists in a concurrent allocator, without incurring the overhead of a lock for every allocation (which would neutralize the intended performance gain of the freelist)?
R – Freelists with concurrent allocators
In general unrolling loops by hand is not worth the effort. The compiler knows better how the target architecture works and will unroll the loop if it is beneficial.
There are code-paths that benefit when unrolled for Pentium-M type CPU's but don't benefit for Core2 for example. If I unroll by hand the compiler can't make the decision anymore and I may end up with less than optimal code. E.g. exactly the opposite I tried to achieve.
There are several cases where I do unroll performance critical loops by hand, but I only do this if I know that the compiler will - after manual unrolling - be able to use architectural specific feature such as SSE or MMX instructions. Then, and only then I do it.
Btw - modern CPUs are very efficient at executing well predictable branches. This is exactly what a loop is. The loop overhead is so small these days that it rarely makes a difference. Memory latency effects that may occur due to the increase in code-size will however make a difference.
The first thing about CMS that I have learned is it needs more memory than the other collectors, about 25 to 50% more is a good starting point. This helps you avoid fragmentation, since CMS does not do any compaction like the stop the world collectors would. Second, do things that help the garbage collector; Integer.valueOf instead of new Integer, get rid of anonymous classes, make sure inner classes are not accessing inaccessible things (private in the outer class) stuff like that. The less garbage the better. FindBugs and not ignoring warnings will help a lot with this.
As far as tuning, I have found that you need to try several things:
Tells JVM to use CMS in tenured gen.
Fix the size of your heap: -Xmx2048m -Xms2048m This prevents GC from having to do things like grow and shrink the heap.
use parallel instead of serial collection in the young generation. This will speed up your minor collections, especially if you have a very large young gen configured. A large young generation is generally good, but don't go more than half of the old gen size.
set the number of threads that CMS will use when it is doing things that can be done in parallel.
-XX:+CMSParallelRemarkEnabled remark is serial by default, this can speed you up.
-XX:+CMSIncrementalMode allows application to run more by pasuing GC between phases
-XX:+CMSIncrementalPacing allows JVM to figure change how often it collects over time
-XX:CMSIncrementalDutyCycleMin=X Minimm amount of time spent doing GC
-XX:CMSIncrementalDutyCycle=X Start by doing GC this % of the time
I have found that you can get generally low pause times if you set it up so that it is basically always collecting. Since most of the work is done in parallel, you end up with basically regular predictable pauses.
This one is very important. It tells the CMS collector to always complete the collection before it starts a new one. Without this, you can run into the situation where it throws a bunch of work away and starts again.
By default, CMS will let your PermGen grow till it kills your app a few weeks from now. This stops that. Your PermGen would only be growing though if you make use of Reflection, or are misusing String.intern, or doing something bad with a class loader, or a few other things.
Survivor ratio and tenuring theshold can also be played with, depending on if you have long or short lived objects, and how much object copying between survivor spaces you can live with. If you know all your objects are going to stick around, you can configure zero sized survivor spaces, and anything that survives one young gen collection will be immediately tenured.
Use a lock-free linked list.