Some test results
I've gotten a lot of good answers to this question--thanks folks--so I decided to run some tests and figure out which method is actually fastest. The five methods I tested are these:
- the "ContainsKey" method that I presented in the question
- the "TestForNull" method suggested by Aleksandar Dimitrov
- the "AtomicLong" method suggested by Hank Gay
- the "Trove" method suggested by jrudolph
- the "MutableInt" method suggested by phax.myopenid.com
Method
Here's what I did...
- created five classes that were identical except for the differences shown below. Each class had to perform an operation typical of the scenario I presented: opening a 10MB file and reading it in, then performing a frequency count of all the word tokens in the file. Since this took an average of only 3 seconds, I had it perform the frequency count (not the I/O) 10 times.
- timed the loop of 10 iterations but not the I/O operation and recorded the total time taken (in clock seconds) essentially using Ian Darwin's method in the Java Cookbook.
- performed all five tests in series, and then did this another three times.
- averaged the four results for each method.
Results
I'll present the results first and the code below for those who are interested.
The ContainsKey method was, as expected, the slowest, so I'll give the speed of each method in comparison to the speed of that method.
- ContainsKey: 30.654 seconds (baseline)
- AtomicLong: 29.780 seconds (1.03 times as fast)
- TestForNull: 28.804 seconds (1.06 times as fast)
- Trove: 26.313 seconds (1.16 times as fast)
- MutableInt: 25.747 seconds (1.19 times as fast)
Conclusions
It would appear that only the MutableInt method and the Trove method are significantly faster, in that only they give a performance boost of more than 10%. However, if threading is an issue, AtomicLong might be more attractive than the others (I'm not really sure). I also ran TestForNull with final
variables, but the difference was negligible.
Note that I haven't profiled memory usage in the different scenarios. I'd be happy to hear from anybody who has good insights into how the MutableInt and Trove methods would be likely to affect memory usage.
Personally, I find the MutableInt method the most attractive, since it doesn't require loading any third-party classes. So unless I discover problems with it, that's the way I'm most likely to go.
The code
Here is the crucial code from each method.
ContainsKey
import java.util.HashMap;
import java.util.Map;
...
Map<String, Integer> freq = new HashMap<String, Integer>();
...
int count = freq.containsKey(word) ? freq.get(word) : 0;
freq.put(word, count + 1);
TestForNull
import java.util.HashMap;
import java.util.Map;
...
Map<String, Integer> freq = new HashMap<String, Integer>();
...
Integer count = freq.get(word);
if (count == null) {
freq.put(word, 1);
}
else {
freq.put(word, count + 1);
}
AtomicLong
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.atomic.AtomicLong;
...
final ConcurrentMap<String, AtomicLong> map =
new ConcurrentHashMap<String, AtomicLong>();
...
map.putIfAbsent(word, new AtomicLong(0));
map.get(word).incrementAndGet();
Trove
import gnu.trove.TObjectIntHashMap;
...
TObjectIntHashMap<String> freq = new TObjectIntHashMap<String>();
...
freq.adjustOrPutValue(word, 1, 1);
MutableInt
import java.util.HashMap;
import java.util.Map;
...
class MutableInt {
int value = 1; // note that we start at 1 since we're counting
public void increment () { ++value; }
public int get () { return value; }
}
...
Map<String, MutableInt> freq = new HashMap<String, MutableInt>();
...
MutableInt count = freq.get(word);
if (count == null) {
freq.put(word, new MutableInt());
}
else {
count.increment();
}
Best Solution
The question is (now) about storing lots of data, which can be represented using primitive types like
int
, in a Map. Some of the answers here are very misleading in my opinion. Let's see why.I modified the benchmark from trove to measure both runtime and memory consumption. I also added PCJ to this benchmark, which is another collections library for primitive types (I use that one extensively). The 'official' trove benchmark does not compare IntIntMaps to Java Collection's
Map<Integer, Integer>
, probably storingIntegers
and storingints
is not the same from a technical point of view. But a user might not care about this technical detail, he wants to store data representable withints
efficiently.First the relevant part of the code:
I assume the data comes as primitive
ints
, which seems sane. But this implies a runtime penalty for java util, because of the auto-boxing, which is not neccessary for the primitive collections frameworks.The runtime results (without
gc()
calls, of course) on WinXP, jdk1.6.0_10:While this might already seem drastic, this is not the reason to use such a framework.
The reason is memory performance. The results for a Map containing 100000
int
entries:Java Collections needs more than three times the memory compared to the primitive collection frameworks. I.e. you can keep three times as much data in memory, without resorting to disk IO which lowers runtime performance by magnitudes. And this matters. Read highscalability to find out why.
In my experience high memory consumption is the biggest performance issue with Java, which of course results in worse runtime performance as well. Primitive collection frameworks can really help here.
So: No, java.util is not the answer. And "adding functionality" to Java collections is not the point when asking about efficiency. Also the modern JDK collections do not "out-perform even the specialized Trove collections".
Disclaimer: The benchmark here is far from complete, nor is it perfect. It is meant to drive home the point, which I have experienced in many projects. Primitive collections are useful enough to tolerate fishy API - if you work with lots of data.