Skip to content
Advertisement

Extract Checkbox value out of PDF 1.7 using PDFBox

I have recently started working with pdfbox to extract text out of pdf. Though along with text I also need to extract checkbox value show in image. I have tried different methods to find the checkbox element and extract its values.

Checkboximage

After researching the pdf text through this tool I found that the checkbox is not image or anything but some kind of graphics represented by below content.

JavaScript

I am not sure how to extract this out of pdf, I have seen different parser provided by pdfbox but it looks like I need to have more information about how pdf is constructed. Any pointers would be much more appreciated.

Advertisement

Answer

In a comment you confirm that

all check boxes and check marks are drawn identically

in your input documents.

To extract the check boxes and their check state from your document, therefore, you can search the page content exactly for instruction sequences drawing the boxes and marks therein like in the example document.

How Boxes And Check Marks Are Drawn

As you already found out, the boxes are drawn by filling one path for each edge (top, right, bottom, left) respectively like this in case of the “yes” box for question 1:

JavaScript

Inspecting all the boxes in the document you can see that their drawing instructions follow this pattern:

JavaScript

Here A and C are the left and right x coordinates of the box and B and D are the top and bottom y coordinates thereof.

Similarly the check marks are drawn by filling two paths (left and right half) respectively like this in case of the mark in the “yes” box for question 1:

JavaScript

Inspecting all the check marks in the document you can see that their drawing instructions follow this pattern:

JavaScript

The first line transforms the coordinate system by rotating it by 45° around some point; this allows to draw the check mark using mostly horizontal and vertical lines.

In this rotated coordinate system (A,B) are the coordinates of the left top corner of the longer check mark arm and (A,C) are those of upmost point of of the line where the two arms of the check mark join.

How to Search for Those Instruction Sequences

A related task has been implemented in the PdfBoxFinder class in this answer, a class that collects lines drawn as thin, long rectangles forming a grid.

Thus, we can use the same foundation, the PDFBox PDFGraphicsStreamEngine class, in our case. We merely have to look at different kinds of paths (built by move-to and line-to instructions, not be rectangle instructions) and of course process the paths differently (instead of recognizing a grid, we must recognize our specific check boxes and check marks).

Such a check box finder class can be implemented like this:

JavaScript

(PdfCheckBoxFinder)

You can use the PdfCheckBoxFinder like this to find the check boxes of a document and their checked states:

JavaScript

(ExtractCheckBoxes test testExtractFromUpdatedForm)

For your example PDF one gets

JavaScript

(The coordinates are in the natural coordinate system given by the crop box of the PDF page in question. To relate to coordinates from the PDFTextStripper a transformation into the proprietary coordinate system of the text stripper may be necessary.)

Beware, though, as said at the start the code above only works for check boxes and check marks built exactly as in your example PDF. You confirmed that this would be the case but probably you will be surprised.

If you actually encounter a (very!) few variations thereof, you can add PathType entries matching all of them and enhance getBoxes accordingly to recognize all those variations.

If you happen to come across more than only a few variations, you should go for OCR.

How to Combine the Check Boxes With Text Extraction

In a comment you proposed

is there a possibility if I can remove the graphics and replate it with some text for an example C or ‘N’ then I can do text extraction of the newly generated pdf

Indeed, one can simply add textual marks for check and unchecked check boxes to the page and then apply text extraction to get the text including the marks. I would propose, though, to use DingBats like ✔ and ✗. This can be done like this:

JavaScript

(ExtractCheckBoxes test testExtractInlinedInTextFromUpdatedForm)

For your example PDF one gets

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement