I need help to achieve a mapping between text and image objects in a PDF document. As the first figure shows, my PDF documents have 3 images arranged randomly in the y-direction. To the left of them are texts. The texts extend along the height of the images. My goal is to combine the texts into “ImObj” objects (see the
Tag: pdf-parsing
Extract all text with string positions from a PDF
This may seem an old question, but I didn’t find an exhaustive answer after spending half an hour searching all over SO. I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf