I'd like to compare some large objects representing trees and cache something to avoid comparing each time the new object with one already existing…
The question is what would be the best something ? (a compromise between performance and collisions…).
On the one hand, I have a regular hashCode function based on the value of various fields (following the chapter 3 of effective Java. But I'm not able to evaluate the potential collisions entailed by such an approach.
On the other hand, I have the MessageDigest approach from the standard java distribution with SHA-1 algorithm. I presume it's not going to be efficient but I may have less collision. Am I right ? Is it a correct solution in my context or am I completely wrong ?
The thing is that I don't know what would be the size of the objects. Please also note that the value computed is not going to be used in a HashTable.
thx…
Best Solution
See the following:
Keep in mind the following:
Generally, you can determine the chance of a collision based upon the number of expected objects and the number of possible hashes (max hash value). See http://en.wikipedia.org/wiki/Birthday_paradox for the detailed explanation.
Personally? Java objects (instantiated classes) < 10,000? Hash code. Representing files / blobs / lots of data? SHA-1. I use SHA-1 hashing in my database to keep people from doing ETL work on the same file more than once. I then use SHA-1 hashing again at a second level to keep people from ETLing the same section in more than once file (e.g., different files but the same order shows up twice).