Sunday, February 10, 2008

Performance of unzip

I've been pondering the performance of my unzip implementation for the last few hours....

The use case that is most popular for my implementation is the unpacking of a 500 MB zip file. In most cases, the entire file is unzipped (as opposed to extracting individual files from the archive). My implementation uses standard fopen/fread/fseek for portability reasons (has to work on UNIX and Windows).

For those of you not familiar with the zip file format, it uses a central directory. Comparing it against something like tar/gzip, this means that you get the benefits of random access (meaning you can extract individual files without traversing the entire archive), but at the cost of not being able to use zip files in a streaming manner. So if you need a single file from a 500 MB archive, you just find it in the directory at the front of the file, and then fseek() to the proper offset and extract it.

However, in my case, most of the time the entire archive has to be extracted. If you are using a sequential access mechanism, this could be very expensive. Unless you are prepared to build a cache of the entire directory in memory and index the cache based on offset, your access pattern looks something like this:

Open zip directory
{
Process directory entry
fseek to proper offset where the compressed content is
decompress the content
fseek back to the directory
advance to next directory entry
} while (i < entries)

So you are continuously moving the seek pointer back and forth, which causes the OS to not be able to employ good buffering strategy.

mmap(), or the Win32 equivalent MapViewofFile(), could be pretty useful here. It would allow random access at fixed cost, but I am not sure what the effects are in terms of virtual memory buffering. There are also two types of buffering to consider - prefetching data I have not asked for yet and keeping in memory data I have already asked for. In my usage pattern, I really need good prefetching, since I generally only need to access a specific region of memory once. While you could argue that the regions comprising the zip directory is access multiple times, that probably uses less than 1% of the archive, so the goal would be to optimize for access to the compressed content.

More on this later...