Skip to content
Advertisement

How to improve java.util.zip.GZIPInputStream performance to unzip a large .gz file?

I’m trying to unzip a very large .gz file in java around 50MB and then transferring it to hadoop file system. After unzipping, the file size becomes 20 GB. It takes more than 5 min to do this job.

protected void write(BufferedInputStream bis, Path outputPath, FileSystem hdfs) throws IOException 
{
        BufferedOutputStream bos = new BufferedOutputStream(hdfs.create(outputPath));
        IOUtils.copyBytes(bis, bos, 8*1024);
}

Even after using Buffered I/O streams, it is taking very long to decompress and transfer the file.

Does Hadoop is causing file transfer to be slow or GZIPInputStream is slow?

Advertisement

Answer

Writing 20 Gb will take time. If you do it in 300 seconds you still write more than 70 Mb a second.

You may simply hit the limit of the platform.

If you rewrite your processing code to read the compressesed file that may help.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement