I am trying to pull specific items out of massive zip files in S3 without downloading the entire file.
A Python solution here: Read ZIP files from S3 without downloading the entire file appears to work. The equivalent underlying capabilities in Java appear to be less lenient in general so I’ve had to make various adjustments.
In the attached code you can see that I’m successfully getting the central directory and writing it to a temp file that Java’s ZipFile can use to iterate the zip entries from the CD.
However, I’m stuck on inflating an individual entry. The current code throws a bad header exception. Do I need to give the inflator the local file header + compressed content, or just compressed content? I’ve tried both but I’m clearly either not using the Inflator corrector and/or not giving it what it expects.
import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileOutputStream; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.util.Arrays; import java.util.Enumeration; import java.util.zip.Inflater; import java.util.zip.ZipEntry; import java.util.zip.ZipFile; import com.amazonaws.services.s3.AmazonS3; import com.amazonaws.services.s3.AmazonS3ClientBuilder; import com.amazonaws.services.s3.model.GetObjectRequest; import com.amazonaws.services.s3.model.ObjectMetadata; import com.amazonaws.services.s3.model.S3Object; import com.amazonaws.util.IOUtils; public class S3ZipTest { private AmazonS3 s3; public S3ZipTest(String bucket, String key) throws Exception { s3 = getClient(); ObjectMetadata metadata = s3.getObjectMetadata(bucket, key); runTest(bucket, key, metadata.getContentLength()); } private void runTest(String bucket, String key, long size) throws Exception { // fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty) long start = size - 22; GetObjectRequest req = new GetObjectRequest(bucket, key).withRange(start); System.out.println("eocd start: " + start); // fetch the end of cd record S3Object s3Object = s3.getObject(req); byte[] eocd = IOUtils.toByteArray(s3Object.getObjectContent()); // get the start offset and size of the central directory int cdSize = byteArrayToLeInt(Arrays.copyOfRange(eocd, 12, 16)); int cdStart = byteArrayToLeInt(Arrays.copyOfRange(eocd, 16, 20)); System.out.println("cdStart: " + cdStart); System.out.println("cdSize: " + cdSize); // get the full central directory req = new GetObjectRequest(bucket, key).withRange(cdStart, cdStart + cdSize - 1); s3Object = s3.getObject(req); byte[] cd = IOUtils.toByteArray(s3Object.getObjectContent()); // write the full dir + eocd: ByteArrayOutputStream out = new ByteArrayOutputStream(); // write cd out.write(cd); // write eocd, resetting the cd start to 0 since that is // where it will appear in our new temp file byte[] b = leIntToByteArray(0); eocd[16] = b[0]; eocd[17] = b[1]; eocd[18] = b[2]; eocd[19] = b[3]; out.write(eocd); out.flush(); byte[] cdbytes = out.toByteArray(); // here we are writing the CD + EOCD to a temp file. // ZipFile can read the entries from this file. // ZipInputStream and commons compress will not- they seem upset that the data isn't actually here File tempFile = new File("temp.zip"); FileOutputStream output = new FileOutputStream(tempFile); output.write(cdbytes); output.flush(); output.close(); ZipFile zipFile = new ZipFile(tempFile); Enumeration<? extends ZipEntry> zipEntries = zipFile.entries(); long offset = 0; while (zipEntries.hasMoreElements()) { ZipEntry entry = (ZipEntry) zipEntries.nextElement(); long fileSize = 0; long extra = entry.getExtra() == null ? 0 : entry.getExtra().length; offset += 30 + entry.getName().length() + extra; if (!entry.isDirectory()) { fileSize = entry.getCompressedSize(); System.out.println(entry.getName() + " offset=" + offset + " size" + fileSize); // not working // getEntryContent(bucket, key, offset, fileSize, (int)entry.getSize()); } offset += fileSize; } zipFile.close(); } private void getEntryContent(String bucket, String key, long offset, long compressedSize, int fullSize) throws Exception { //HERE is where things go bad. //my guess was that we need to get past the local header for an entry to the actual //start of deflated content and then read all the content and pass to the Inflator. //this yields java.util.zip.DataFormatException: incorrect header check System.out.print("reading " + compressedSize + " bytes starting from offset " + offset); GetObjectRequest req = new GetObjectRequest(bucket, key).withRange(offset, offset + compressedSize); S3Object s3Object = s3.getObject(req); byte[] con = IOUtils.toByteArray(s3Object.getObjectContent()); Inflater inf = new Inflater(); inf.setInput(con); byte[] inflatedContent = new byte[fullSize]; int sz = inf.inflate(inflatedContent); System.out.println("inflated: " + sz); // write inflatedContent to file or whatever... } public static int byteArrayToLeInt(byte[] b) { final ByteBuffer bb = ByteBuffer.wrap(b); bb.order(ByteOrder.LITTLE_ENDIAN); return bb.getInt(); } public static byte[] leIntToByteArray(int i) { final ByteBuffer bb = ByteBuffer.allocate(Integer.SIZE / Byte.SIZE); bb.order(ByteOrder.LITTLE_ENDIAN); bb.putInt(i); return bb.array(); } protected AmazonS3 getClient() { AmazonS3 client = AmazonS3ClientBuilder .standard() .withRegion("us-east-1") .build(); return client; } public static void main(String[] args) { try { new S3ZipTest("alexa-public", "test.zip"); } catch (Exception e) { e.printStackTrace(); } } }
Edit
Comparing the Python calculations to the ones in my Java code, I realized that Java is off by 4. entry.getExtra().length
may report 24, for example, as does the zipinfo cmd line utility for the same entry. Python reports 28. I don’t fully understand the discrepancy but the PKWare spec mentions “2 byte identifier and a 2 byte data size field” for the extra field. At any rate, adding a fudge value of 4 got it working, but I’d like to understand what is happening a little more- adding random fudge values to make things work isn’t settling:
offset += 30 + entry.getName().length() + extra + 4;
Advertisement
Answer
My general approach was sound but hindered by the lack of details returned from Java’s ZipFile. For example sometimes there is an extra 16 bytes at the end of the compressed data, prior to the beginning of the next local header. Nothing in ZipFile can help with this.
zip4j appears to be a superior option and provides methods such as:
header.getOffsetLocalHeader()
which removes some of the error prone calculations.