I had an interesting job interview experience a while back. The question started really easy:

Q1: We have a bag containing numbers`1`

,`2`

,`3`

, …,`100`

. Each number appears exactly once, so there are 100 numbers. Now one number is randomly picked out of the bag. Find the missing number.

I've heard this interview question before, of course, so I very quickly answered along the lines of:

A1: Well, the sum of the numbers`1 + 2 + 3 + … + N`

is`(N+1)(N/2)`

(see Wikipedia: sum of arithmetic series). For`N = 100`

, the sum is`5050`

.Thus, if all numbers are present in the bag, the sum will be exactly

`5050`

. Since one number is missing, the sum will be less than this, and the difference is that number. So we can find that missing number in`O(N)`

time and`O(1)`

space.

At this point I thought I had done well, but all of a sudden the question took an unexpected turn:

Q2: That is correct, but now how would you do this ifTWOnumbers are missing?

I had never seen/heard/considered this variation before, so I panicked and couldn't answer the question. The interviewer insisted on knowing my thought process, so I mentioned that perhaps we can get more information by comparing against the expected product, or perhaps doing a second pass after having gathered some information from the first pass, etc, but I really was just shooting in the dark rather than actually having a clear path to the solution.

The interviewer did try to encourage me by saying that having a second equation is indeed one way to solve the problem. At this point I was kind of upset (for not knowing the answer before hand), and asked if this is a general (read: "useful") programming technique, or if it's just a trick/gotcha answer.

The interviewer's answer surprised me: you can generalize the technique to find 3 missing numbers. In fact, you can generalize it to find *k* missing numbers.

Qk: If exactlyknumbers are missing from the bag, how would you find it efficiently?

This was a few months ago, and I still couldn't figure out what this technique is. Obviously there's a `Ω(N)`

time lower bound since we must scan all the numbers at least once, but the interviewer insisted that the *TIME* and *SPACE* complexity of the solving technique (minus the `O(N)`

time input scan) is defined in *k* not *N*.

So the question here is simple:

- How would you solve
**Q2**? - How would you solve
**Q3**? - How would you solve
**Qk**?

### Clarifications

- Generally there are
*N*numbers from 1..*N*, not just 1..100. - I'm not looking for the obvious set-based solution, e.g. using a bit set, encoding the presence/absence each number by the value of a designated bit, therefore using
`O(N)`

bits in additional space. We can't afford any additional space proportional to*N*. - I'm also not looking for the obvious sort-first approach. This and the set-based approach are worth mentioning in an interview (they are easy to implement, and depending on
*N*, can be very practical). I'm looking for the Holy Grail solution (which may or may not be practical to implement, but has the desired asymptotic characteristics nevertheless).

So again, of course you must scan the input in `O(N)`

, but you can only capture small amount of information (defined in terms of *k* not *N*), and must then find the *k* missing numbers somehow.

## Best Solution

Here's a summary of Dimitris Andreou's link.

Remember sum of i-th powers, where i=1,2,..,k. This reduces the problem to solving the system of equations

a

_{1}+ a_{2}+ ... + a_{k}= b_{1}a

_{1}^{2}+ a_{2}^{2}+ ... + a_{k}^{2}= b_{2}...

a

_{1}^{k}+ a_{2}^{k}+ ... + a_{k}^{k}= b_{k}Using Newton's identities, knowing b

_{i}allows to computec

_{1}= a_{1}+ a_{2}+ ... a_{k}c

_{2}= a_{1}a_{2}+ a_{1}a_{3}+ ... + a_{k-1}a_{k}...

c

_{k}= a_{1}a_{2}... a_{k}If you expand the polynomial (x-a

_{1})...(x-a_{k}) the coefficients will be exactly c_{1}, ..., c_{k}- see Viète's formulas. Since every polynomial factors uniquely (ring of polynomials is an Euclidean domain), this means a_{i}are uniquely determined, up to permutation.This ends a proof that remembering powers is enough to recover the numbers. For constant k, this is a good approach.

However, when k is varying, the direct approach of computing c

_{1},...,c_{k}is prohibitely expensive, since e.g. c_{k}is the product of all missing numbers, magnitude n!/(n-k)!. To overcome this, perform computations in Z_{q}field, where q is a prime such that n <= q < 2n - it exists by Bertrand's postulate. The proof doesn't need to be changed, since the formulas still hold, and factorization of polynomials is still unique. You also need an algorithm for factorization over finite fields, for example the one by Berlekamp or Cantor-Zassenhaus.High level pseudocode for constant k:

_{i}._{i}; call them c_{i}. Basically, c_{1}= b_{1}; c_{2}= (c_{1}b_{1}- b_{2})/2; see Wikipedia for exact formulas^{k}-c_{1}x^{k-1}+ ... + c_{k}._{1}, ..., a_{k}.For varying k, find a prime n <= q < 2n using e.g. Miller-Rabin, and perform the steps with all numbers reduced modulo q.

EDIT: The previous version of this answer stated that instead of Z

_{q}, where q is prime, it is possible to use a finite field of characteristic 2 (q=2^(log n)). This is not the case, since Newton's formulas require division by numbers up to k.