I have a number of zip files that I need to distribute to users, around 130 of them. Each zip file contains a number of similar text, html, xml, and jpg files. In total, the zip files total 146 megabytes; unzipped, their contents total 551mb.
I want to distribute all these files together to users in as small a format as possible. I looked into two different ways of doing it, each using two different compression schemes, zip and 7zip (which I understand is either LZMA or a variant thereof):
- Compress all the zip files into a compressed file and send that file (
- Compress the unzipped contents of the zip files into a compressed file and send that file (
For example, say that I have 3 zip files, A.zip, B.zip and C.zip, each of which contains one text file, one html file, and one XML file. With method 1, a single compressed file would be created containing A.zip, B.zip and C.zip. With method 2, a single compressed file would be created containing A.txt, A.html, A.xml, B.txt, B.html, B.xml, C.txt, C.html, and C.xml.
My assumption was that under either compression scheme, the file generated by method 2 would be smaller or at least the same size as the file generated by method 1, as you might be able to exploit efficiencies by considering all the files together. At the very least, method 2 would avoid the overhead of multiple zip files.
The surprising results (the sizes of files generated by the 7zip tool) were as follows:
- single.zip – 142mb
- single.7z – 124mb
- combined.zip – 149mb
- combined.7z – 38mb
I'm not surprised that the 7zip format produced smaller files than the zip format (result 2/4 vs result 1/3), as it generally compresses better than zip. What was surprising was that for the zip format, compressing all 130 zip files together resulted in a smaller output file than compressing all their uncompressed contents (result 3 vs result 1).
Why is it more efficient to zip several zip files together, than to zip their unzipped contents together?
The only thing I can think of is that during compression, the 7zip format builds a dictionary across all the file contents, so it can exploit similarities between files, while the zip format builds the dictionary per-file. Is that true? And even that still doesn't explain why result 3 was 7mb larger than result 1.
Thanks for your help.