catch PDFBox warnings when loading erroneous PDFs

Tags: ,



when loading a PDF with PDFBox one gets log-level warnings if the PDF is erroneous:

    PDDocument doc = PDDocument.load (new File (filename));

For example, this could lead to the following output on the console:

Dez 08, 2020 9:14:41 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength 
WARNING: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 3141, length: 1674, expected end position: 4815

Obviously, the pdf has some errors in the content stream, but it does load into doc. But would it be possible to catch this warnings programmatically with PDFBox? Do some properties exist which tell you about the warnings after the document has been loaded?

I’ve tried PDFBox-Preflight, but that checks for PDF/A compliance, which leads to much more messages.

Answer

Try the non-lenient mode of the parser. This code is from the ShowSignature.java example:

RandomAccessBufferedFileInputStream raFile = new RandomAccessBufferedFileInputStream(file);
// If your files are not too large, you can also download the PDF into a byte array
// with IOUtils.toByteArray() and pass a RandomAccessBuffer() object to the
// PDFParser constructor.
PDFParser parser = new PDFParser(raFile);
parser.setLenient(false);
parser.parse();
PDDocument document = parser.getPDDocument();


Source: stackoverflow