Skip to content

Extract all text with string positions from a PDF

This may seem an old question, but I didn’t find an exhaustive answer after spending half an hour searching all over SO.

I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf I am using (E-Tickets) the program fails to recognize strings, printing each character separately. The output is a list of strings (each representing a TextPosition object) like this:

String[414.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.0] s
String[418.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] a
String[423.38696,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=1.776001] l
String[425.16296,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] e

While I would like the program to recognize the string “sale” as an unique TextPosition and give me its position. I also tried to play with the setSpacingTolerance() and setAverageCharacterTolerance() PDFTextStripper methods, setting different values above and under the standard values (which FYI are 0.5 and 0.3 respectively), but the output didn’t change at all. Where am I going wrong? Thanks in advance.

Answer

As Joey mentioned, PDF is just a collection of instructions telling you where a certain character should be printed.

In order to extract words or lines, you will have to perform some data segmentation: studying the bounding boxes of the characters should let you recognize those that are on a same line and then which one form words.