when loading a PDF with PDFBox one gets log-level warnings if the PDF is erroneous:
PDDocument doc = PDDocument.load (new File (filename));
For example, this could lead to the following output on the console:
Dez 08, 2020 9:14:41 AM org.apache.pdfbox.pdfparser.COSParser validateStreamLength WARNING: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 3141, length: 1674, expected end position: 4815
Obviously, the pdf has some errors in the content stream, but it does load into
doc. But would it be possible to catch this warnings programmatically with PDFBox? Do some properties exist which tell you about the warnings after the document has been loaded?
I’ve tried PDFBox-Preflight, but that checks for PDF/A compliance, which leads to much more messages.
Try the non-lenient mode of the parser. This code is from the ShowSignature.java example:
RandomAccessBufferedFileInputStream raFile = new RandomAccessBufferedFileInputStream(file); // If your files are not too large, you can also download the PDF into a byte array // with IOUtils.toByteArray() and pass a RandomAccessBuffer() object to the // PDFParser constructor. PDFParser parser = new PDFParser(raFile); parser.setLenient(false); parser.parse(); PDDocument document = parser.getPDDocument();