Skip to content
Advertisement

Selectively extract entries from a zip file in S3 without downloading entire file

I am trying to pull specific items out of massive zip files in S3 without downloading the entire file.

A Python solution here: Read ZIP files from S3 without downloading the entire file appears to work. The equivalent underlying capabilities in Java appear to be less lenient in general so I’ve had to make various adjustments.

In the attached code you can see that I’m successfully getting the central directory and writing it to a temp file that Java’s ZipFile can use to iterate the zip entries from the CD.

However, I’m stuck on inflating an individual entry. The current code throws a bad header exception. Do I need to give the inflator the local file header + compressed content, or just compressed content? I’ve tried both but I’m clearly either not using the Inflator corrector and/or not giving it what it expects.

JavaScript

Edit

Comparing the Python calculations to the ones in my Java code, I realized that Java is off by 4. entry.getExtra().length may report 24, for example, as does the zipinfo cmd line utility for the same entry. Python reports 28. I don’t fully understand the discrepancy but the PKWare spec mentions “2 byte identifier and a 2 byte data size field” for the extra field. At any rate, adding a fudge value of 4 got it working, but I’d like to understand what is happening a little more- adding random fudge values to make things work isn’t settling: offset += 30 + entry.getName().length() + extra + 4;

Advertisement

Answer

My general approach was sound but hindered by the lack of details returned from Java’s ZipFile. For example sometimes there is an extra 16 bytes at the end of the compressed data, prior to the beginning of the next local header. Nothing in ZipFile can help with this.

zip4j appears to be a superior option and provides methods such as: header.getOffsetLocalHeader() which removes some of the error prone calculations.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement