Bash – Best way to simulate “group by” from bash


Suppose you have a file that contains IP addresses, one address in each line:

You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output: 3 1 1

One way to do this is:

cat ip_addresses |uniq |while read ip
    echo -n $ip" "
    grep -c $ip ip_addresses

However it is really far from being efficient.

How would you solve this problem more efficiently using bash?

(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)


Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.

I liked the hashtable-like solution – anybody can provide improvements to that solution?


Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.

So please, don't blame the question, just ignore it if you don't like it. 🙂

Best Solution

sort ip_addresses | uniq -c

This will print the count first, but other than that it should be exactly what you want.