I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this purpose.
Can somebody help me out here to achieve the functionality. I have tried this so far but no success
Code Snippet
public class SampleZipExtract { public static void main(String[] args) { List<String> tempString = new ArrayList<String>(); StringBuffer sbf = new StringBuffer(); File file = new File("C:\Users\xxx\Desktop\abc.zip"); InputStream input; try { input = new FileInputStream(file); ZipInputStream zip = new ZipInputStream(input); ZipEntry entry = zip.getNextEntry(); BodyContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); while (entry!= null){ if(entry.getName().endsWith(".txt") || entry.getName().endsWith(".pdf")|| entry.getName().endsWith(".docx")){ System.out.println("entry=" + entry.getName() + " " + entry.getSize()); parser.parse(input, textHandler, metadata, new ParseContext()); tempString.add(textHandler.toString()); } } zip.close(); input.close(); for (String text : tempString) { System.out.println("Apache Tika - Converted input string : " + text); sbf.append(text); System.out.println("Final text from all the three files " + sbf.toString()); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (SAXException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (TikaException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
Advertisement
Answer
If you’re wondering how to get the file content from each ZipEntry
it’s actually quite simple. Here’s a sample code:
public static void main(String[] args) throws IOException { ZipFile zipFile = new ZipFile("C:/test.zip"); Enumeration<? extends ZipEntry> entries = zipFile.entries(); while(entries.hasMoreElements()){ ZipEntry entry = entries.nextElement(); InputStream stream = zipFile.getInputStream(entry); } }
Once you have the InputStream you can read it however you want.