Tag: pdf-parsing

Apache PDFBox – vertical match between image and text position

I need help to achieve a mapping between text and image objects in a PDF document. As the first figure shows, my PDF documents have 3 images arranged randomly in the y-direction. To the left of them are texts. The texts extend along the height of the images. My goal is to combine the texts into “ImObj” objects (see the

Extract all text with string positions from a PDF

java pdf-parsing pdfbox

This may seem an old question, but I didn’t find an exhaustive answer after spending half an hour searching all over SO. I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf