Java – Fastest Hash algorithm in Java for Strings

hashjavamerkle-tree

To make it simple, my question is: how to hash a String (about 200 characters) as quickly as possible. Security is not important, but collisions ARE a big deal.

Note: After a quick investigation, it seems like MurmurHash3 might be the best choice. I am open to any comment saying otherwise tho'

First, I know that there are plenty of other similar question, but I couldn't find a convincing answer yet.

I have a list of objects, each containing a list of about 3k paragraphs which is saved to a database. Every X hours, those paragraph are regenerated and I need to find if any paragraphs has changed, and if so, push only those new paragraphs.

The quickest way I found to find the differences (knowing that most of the time the content will be identical) is to create a MerkleTree, save it to the DB, and iterate over the MerkleTree to find the differences, instead of comparing the paragraphs themselves.

This imply, in my case, that I will be creating ten thousands of hashes per second to compare with what is in the DB. Therefore, I need a very efficient way to create those hashes. I don't care about the security, I only need to ensure that the number of collision remains very very low.

What would be the best algorithm available in Java for that?


In my case, the main object is composed of Sections, which is composed of Languages, which is composed of Paragraph. The comparison strategy is:

1) If the object hash is identical, stop, otherwise go to 2)

2) Loop on all Section, keep only the Section with a different hash

3) Loop on all Languages of those Sections, keep only the language with a different hash

4) Loop on all the Paragraph of all those Languages, if the hash is different, then push the new content.

Best Solution

This amazing answer on Programmers Stack Exchange tells you all you need to know.

The short version is, use FNV-1a, aka the Fowler–Noll–Vo hash function, it has excellent performance, high randomness and low collisions.

Any further explanation I might shed on this question would be just be a copy and paste from that Programmers.SE answer, which incidentally is the second highest voted answer on the entire site.

Some other thoughts:

  • Ultimately, you have a pretty niche use case. Most people aren't dealing with 1 billion entry datasets regularly. As such, you may have to do your own benchmarking.
  • That said, having a high randomness suggests that the algorithm is likely to scale well for English hashes.
  • You haven't really talked about other issues; are you able to keep the entire data set in memory? What are your footprint requirements?

See also: Fastest Hash Algorithm for Text Data