Ways to hash a numeric vector


Are there any known hash algorithms which input a vector of int's and output a single int that work similarly to an inner product?

In other words, I am thinking about a hash algorithm that might look like this in C++:

// For simplicity, I'm not worrying about overflow, and assuming |v| < 7.
int HashVector(const vector<int>& v) {
  const int N = kSomethingBig;
  const int w[] = {234, 739, 934, 23, 828, 194};  // Carefully chosen constants.
  int result = 0;
  for (int i = 0; i < v.size(); ++i) result = (result + w[i] * v[i]) % N;
  return result;

I'm interested in this because I'm writing up a paper on an algorithm that would benefit from any previous work on similar hashes. In particular, it would be great if there is anything known about the collision properties of a hash algorithm like this.

The algorithm I'm interested in would hash integer vectors, but something for float vectors would also be cool.


The hash is intended for use in a hash table for fast key/value lookups. There is no security concern here.

The desired answer is something like a set of constants that provably work particularly well for a hash like this – analogous to a multiplier and modulo which works better than others as a pseudorandom number generator.

For example, some choices of constants for a linear congruential pseudorandom generator are known to give optimal cycle lengths and have easy-to-compute modulos. Maybe someone has done research to show that a certain set of multiplicative constants, along with a modulo constant, in a vector hash can reduce the chance of collisions amongst nearby integer vectors.

Best Solution

I did some (unpublished, practical) experiments with testing a variety of string hash algorithms. (It turns out that Java's default hash function for Strings sucks.)

The easy experiment is to hash the English dictionary and compare how many collisions you have on algorithm A vs algorithm B.

You can construct a similar experiment: randomly generate $BIG_NUMBER of possible vectors of length 7 or less. Hash them on algorithm A, hash them on algorithm B, then compare number and severity of collisions.

After you're able to do that, you can use simulated annealing or similar techniques to find "magic numbers" which perform well for you. In my work, for given vocabularies of interest and a tightly limited hash size, we were able to make a generic algorithm work well for several human languages by varying the "magic numbers".