Tabula looks like a great tool for extracting tabular data from PDFs. There are plenty of examples of how to call it from the command line or use it in Python but there doesn’t seem to be any documentation for use in Java. Does anyone have a worked example?
Note, tabula does provide source code but it seems confused between versions. For example, the example on GitHub references a TableExtractor class which does not seem to exist in the JAR.
https://github.com/tabulapdf/tabula-java
Advertisement
Answer
you can use the following code to call tabula from java, hope this helps
public static void main(String[] args) throws IOException { final String FILENAME="../test.pdf"; PDDocument pd = PDDocument.load(new File(FILENAME)); int totalPages = pd.getNumberOfPages(); System.out.println("Total Pages in Document: "+totalPages); ObjectExtractor oe = new ObjectExtractor(pd); SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm(); Page page = oe.extract(1); // extract text from the table after detecting List<Table> table = sea.extract(page); for(Table tables: table) { List<List<RectangularTextContainer>> rows = tables.getRows(); for(int i=0; i<rows.size(); i++) { List<RectangularTextContainer> cells = rows.get(i); for(int j=0; j<cells.size(); j++) { System.out.print(cells.get(j).getText()+"|"); } // System.out.println(); } } }