Extract Checkbox value out of PDF 1.7 using PDFBox

Tags: , ,



I have recently started working with pdfbox to extract text out of pdf. Though along with text I also need to extract checkbox value show in image. I have tried different methods to find the checkbox element and extract its values.

Checkboximage

After researching the pdf text through this tool I found that the checkbox is not image or anything but some kind of graphics represented by below content.

ET
Q
q
BT
/F2 6 Tf
481.3 653.29 Td
(  ) Tj
ET
Q
q
1 1 1 rg
484.3 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f
Q
q
BT
/F2 6 Tf
495.55 653.29 Td
(Yes) Tj
ET
Q
q
BT
/F2 6 Tf
504.88 653.29 Td
(  ) Tj
ET
Q
q
1 1 1 rg
507.88 653.29 9 9 re
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 661.54 m
516.13 661.54 l
516.88 662.29 l
507.88 662.29 l
508.63 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 661.54 m
516.13 654.04 l
516.88 653.29 l
516.88 662.29 l
516.13 661.54 l
f
Q
q
0.87059 0.87059 0.87059 rg
516.13 654.04 m
508.63 654.04 l
507.88 653.29 l
516.88 653.29 l
516.13 654.04 l
f
Q
q
0.87059 0.87059 0.87059 rg
508.63 654.04 m
508.63 661.54 l
507.88 662.29 l
507.88 653.29 l
508.63 654.04 l
f
Q
q
BT
/F2 6 Tf
519.13 653.29 Td
(No) Tj
ET
Q
q
BT
/F2 6 Tf
36.75 642.95 Td

I am not sure how to extract this out of pdf, I have seen different parser provided by pdfbox but it looks like I need to have more information about how pdf is constructed. Any pointers would be much more appreciated.

Answer

In a comment you confirm that

all check boxes and check marks are drawn identically

in your input documents.

To extract the check boxes and their check state from your document, therefore, you can search the page content exactly for instruction sequences drawing the boxes and marks therein like in the example document.

How Boxes And Check Marks Are Drawn

As you already found out, the boxes are drawn by filling one path for each edge (top, right, bottom, left) respectively like this in case of the “yes” box for question 1:

485.05 661.54 m
492.55 661.54 l
493.3 662.29 l
484.3 662.29 l
485.05 661.54 l
f
...
492.55 661.54 m
492.55 654.04 l
493.3 653.29 l
493.3 662.29 l
492.55 661.54 l
f
...
492.55 654.04 m
485.05 654.04 l
484.3 653.29 l
493.3 653.29 l
492.55 654.04 l
f
...
485.05 654.04 m
485.05 661.54 l
484.3 662.29 l
484.3 653.29 l
485.05 654.04 l
f

Inspecting all the boxes in the document you can see that their drawing instructions follow this pattern:

A B m
(A+7.5) B l
(A+8.25) (B+0.75) l
(A-0.75) (B+0.75) l
A B l
f
...
C B m
C (B-7.5) l
(C+0.75) (B-8.25) l
(C+0.75) (B+0.75) l
C B l
f
...
C D m
(C-7.5) D l
(C-8.25) (D-0.75) l
(C+0.75) (D-0.75) l
C D l
f
...
A D m
A (D+7.5) l
(A-0.75) (D+8.25) l
(A-0.75) (D-0.75) l
A D l
f 

Here A and C are the left and right x coordinates of the box and B and D are the top and bottom y coordinates thereof.

Similarly the check marks are drawn by filling two paths (left and right half) respectively like this in case of the mark in the “yes” box for question 1:

0.70711 -0.70711 0.70711 0.70711 -323.79 536.88 cm
...
489.55 661.54 m
489.55 657.79 l
490.3 657.04 l
490.3 661.54 l
489.55 661.54 l
f
...
489.55 657.79 m
488.05 657.79 l
488.05 657.04 l
490.3 657.04 l
489.55 657.79 l
f

Inspecting all the check marks in the document you can see that their drawing instructions follow this pattern:

0.70711 -0.70711 0.70711 0.70711 X Y cm 
...
A B m
A (B-3.75) l
(A+0.75) (B-4.5) l
(A+0.75) B l
A B l
f 
...
A C m
(A-1.5) C l
(A-1.5) (C-0.75) l
(A+0.75) (C-0.75) l
A C l
f 

The first line transforms the coordinate system by rotating it by 45° around some point; this allows to draw the check mark using mostly horizontal and vertical lines.

In this rotated coordinate system (A,B) are the coordinates of the left top corner of the longer check mark arm and (A,C) are those of upmost point of of the line where the two arms of the check mark join.

How to Search for Those Instruction Sequences

A related task has been implemented in the PdfBoxFinder class in this answer, a class that collects lines drawn as thin, long rectangles forming a grid.

Thus, we can use the same foundation, the PDFBox PDFGraphicsStreamEngine class, in our case. We merely have to look at different kinds of paths (built by move-to and line-to instructions, not be rectangle instructions) and of course process the paths differently (instead of recognizing a grid, we must recognize our specific check boxes and check marks).

Such a check box finder class can be implemented like this:

public class PdfCheckBoxFinder extends PDFGraphicsStreamEngine {
    public class CheckBox {
        public Point2D getLowerLeft()   {   return lowerLeft;   }
        public Point2D getUpperRight()  {   return upperRight;  }
        public boolean isChecked()      {   return checked;     }

        CheckBox(Point2D lowerLeft, Point2D upperRight, boolean checked) {
            this.lowerLeft = lowerLeft;
            this.upperRight = upperRight;
            this.checked = checked;
        }

        final Point2D lowerLeft;
        final Point2D upperRight;
        final boolean checked;
    }

    public PdfCheckBoxFinder(PDPage page) {
        super(page);
        for (int i = 0; i < pathAnchorsByType.length; i++)
            pathAnchorsByType[i] = new ArrayList<Point2D>();
    }

    public List<CheckBox> getBoxes() {
        if (checkBoxes.isEmpty()) {
            for (Point2D anchor : pathAnchorsByType[PathType.boxBottom.index]) {
                if (containsApproximatly(pathAnchorsByType[PathType.boxLeft.index], anchor) &&
                        containsApproximatly(pathAnchorsByType[PathType.boxRight.index], anchor) &&
                        containsApproximatly(pathAnchorsByType[PathType.boxTop.index], anchor)) {
                    Point2D upperRight = new Point2D.Float(7.5f + (float)anchor.getX(), 7.5f + (float)anchor.getY());
                    boolean checked = containsInRectangle(pathAnchorsByType[PathType.checkLeft.index], anchor, upperRight) &&
                            containsInRectangle(pathAnchorsByType[PathType.checkRight.index], anchor, upperRight);
                    checkBoxes.add(new CheckBox(anchor, upperRight, checked));
                }
            }
        }
        return Collections.unmodifiableList(checkBoxes);
    }

    boolean containsApproximatly(List<Point2D> points, Point2D anchor) {
        for (Point2D point : points) {
            if (approximatelyEquals(point.getX(), anchor.getX()) && approximatelyEquals(point.getY(), anchor.getY()))
                return true;
        }
        return false;
    }

    boolean containsInRectangle(List<Point2D> points, Point2D lowerLeft, Point2D upperRight) {
        for (Point2D point : points) {
            if (lowerLeft.getX() < point.getX() && point.getX() < upperRight.getX() &&
                    lowerLeft.getY() < point.getY() && point.getY() < upperRight.getY())
                return true;
        }
        return false;
    }

    //
    // PDFGraphicsStreamEngine overrides
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        moveTo((float) p0.getX(), (float) p0.getY());
        path.add(new Rectangle(p0, p1, p2, p3));
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        currentPoint = new Point2D.Float(x, y);
        currentStartPoint = currentPoint;
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        Point2D point = new Point2D.Float(x, y);
        path.add(new Line(currentPoint, point));
        currentPoint = point;
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        Point2D point1 = new Point2D.Float(x1, y1);
        Point2D point2 = new Point2D.Float(x2, y2);
        Point2D point3 = new Point2D.Float(x3, y3);
        path.add(new Curve(currentPoint, point1, point2, point3));
        currentPoint = point3;
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return currentPoint;
    }

    @Override
    public void closePath() throws IOException {
        path.add(new Line(currentPoint, currentStartPoint));
        currentPoint = currentStartPoint;
    }

    @Override
    public void endPath() throws IOException {
        clearPath();
    }

    @Override
    public void strokePath() throws IOException {
        clearPath();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        processPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        clearPath();
    }

    @Override public void drawImage(PDImage pdImage) throws IOException { }
    @Override public void clip(int windingRule) throws IOException { }
    @Override public void shadingFill(COSName shadingName) throws IOException { }

    //
    // internal representation of a path
    //
    interface PathElement {
    }

    class Rectangle implements PathElement {
        final Point2D p0, p1, p2, p3;

        Rectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }
    }

    class Line implements PathElement {
        final Point2D p0, p1;

        Line(Point2D p0, Point2D p1) {
            this.p0 = p0;
            this.p1 = p1;
        }
    }

    class Curve implements PathElement {
        final Point2D p0, p1, p2, p3;

        Curve(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }
    }

    Point2D currentPoint = null;
    Point2D currentStartPoint = null;

    void clearPath() {
        path.clear();
        currentPoint = null;
        currentStartPoint = null;
    }

    void processPath() {
        for (PathType pathType : PathType.values()) {
            if (pathType.matches(path)) {
                pathAnchorsByType[pathType.index].add(pathType.getAnchor(path));
            }
        }

        clearPath();
    }

    enum PathType {
        boxTop(new float[] {7.5f, 0f, .75f, .75f, -9f, 0f, .75f, -.75f}, new float[] {0f, -7.5f}, 0),
        boxRight(new float[] {0f, -7.5f, .75f, -.75f, 0f, 9f, -.75f, -.75f}, new float[] {-7.5f, -7.5f}, 1),
        boxBottom(new float[] {-7.5f, 0f, -.75f, -.75f, 9f, 0f, -.75f, .75f}, new float[] {-7.5f, 0f}, 2),
        boxLeft(new float[] {0f, 7.5f, -.75f, .75f, 0f, -9f, .75f, .75f}, new float[] {0f, 0f}, 3),
        checkRight(new float[] {-2.65165f, -2.65165f, 0f, -1.06066f, 3.18198f, 3.18198f, -.53033f, .53033f}, new float[] {-2.65165f, -2.65165f/*-5.1072f, -4.4559f*/}, 4),
        checkLeft(new float[] {-1.06066f, 1.06066f, -.53033f, -.53033f, 1.59099f, -1.59099f, 0f, 1.06066f}, new float[] {0f, 0f/*-2.4556f, -1.8042f*/}, 5)
        ;
        PathType(float[] diffs, float[] offsetToAnchor, int index) {
            this.diffs = diffs;
            this.offsetToAnchor = offsetToAnchor;
            this.index = index;
        }

        boolean matches(List<PathElement> path) {
            if (path != null && path.size() * 2 == diffs.length) {
                for (int i = 0; i < path.size(); i++) {
                    PathElement element = path.get(i);
                    if (!(element instanceof Line))
                        return false;
                    Line line = (Line) element;
                    if (!approximatelyEquals(line.p1.getX() - line.p0.getX(), diffs[i*2]))
                        return false;
                    if (!approximatelyEquals(line.p1.getY() - line.p0.getY(), diffs[i*2+1]))
                        return false;
                }
                return true;
            }
            return false;
        }

        Point2D getAnchor(List<PathElement> path) {
            if (path != null && path.size() > 0) {
                PathElement element = path.get(0);
                if (element instanceof Line) {
                    Line line = (Line) element;
                    Point2D p = line.p0;
                    return new Point2D.Float((float)p.getX() + offsetToAnchor[0], (float)p.getY() + offsetToAnchor[1]);
                }
            }
            return null;
        }

        final float[] diffs;
        final float[] offsetToAnchor;
        final int index;
    }

    static boolean approximatelyEquals(double f, double g) {
        return Math.abs(f - g) < 0.001;
    }

    //
    // members
    //
    final List<PathElement> path = new ArrayList<>();

    final List<Point2D>[] pathAnchorsByType = new List[PathType.values().length];

    final List<CheckBox> checkBoxes = new ArrayList<>(); 
}

(PdfCheckBoxFinder)

You can use the PdfCheckBoxFinder like this to find the check boxes of a document and their checked states:

PDDocument document = ...
for (PDPage page : document.getPages())
{
    PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
    finder.processPage(page);
    for (CheckBox checkBox : finder.getBoxes()) {
        Point2D ll = checkBox.getLowerLeft();
        Point2D ur = checkBox.getUpperRight();
        String checked = checkBox.isChecked() ? "checked" : "not checked";
        System.out.printf(Locale.ROOT, "* (%4.3f, %4.3f) - (%4.3f, %4.3f) - %sn", ll.getX(), ll.getY(), ur.getX(), ur.getY(), checked);
    }
}

(ExtractCheckBoxes test testExtractFromUpdatedForm)

For your example PDF one gets

* (485.050, 654.040) - (492.550, 661.540) - checked
* (508.630, 654.040) - (516.130, 661.540) - not checked
* (485.050, 641.760) - (492.550, 649.260) - checked
* (508.630, 641.760) - (516.130, 649.260) - not checked
* (485.050, 629.490) - (492.550, 636.990) - not checked
* (508.630, 629.490) - (516.130, 636.990) - checked
* (485.050, 617.220) - (492.550, 624.720) - checked
* (508.630, 617.220) - (516.130, 624.720) - not checked
* (485.050, 593.700) - (492.550, 601.200) - checked
* (508.630, 593.700) - (516.130, 601.200) - not checked
* (485.050, 581.420) - (492.550, 588.920) - checked
* (508.630, 581.420) - (516.130, 588.920) - not checked
* (485.050, 569.150) - (492.550, 576.650) - checked
* (508.630, 569.150) - (516.130, 576.650) - not checked
* (91.330, 553.500) - (98.830, 561.000) - not checked
* (125.570, 553.500) - (133.070, 561.000) - not checked
* (200.150, 553.500) - (207.650, 561.000) - not checked
* (286.220, 553.500) - (293.720, 561.000) - not checked
* (77.190, 331.430) - (84.690, 338.930) - not checked

(The coordinates are in the natural coordinate system given by the crop box of the PDF page in question. To relate to coordinates from the PDFTextStripper a transformation into the proprietary coordinate system of the text stripper may be necessary.)

Beware, though, as said at the start the code above only works for check boxes and check marks built exactly as in your example PDF. You confirmed that this would be the case but probably you will be surprised.

If you actually encounter a (very!) few variations thereof, you can add PathType entries matching all of them and enhance getBoxes accordingly to recognize all those variations.

If you happen to come across more than only a few variations, you should go for OCR.

How to Combine the Check Boxes With Text Extraction

In a comment you proposed

is there a possibility if I can remove the graphics and replate it with some text for an example C or ‘N’ then I can do text extraction of the newly generated pdf

Indeed, one can simply add textual marks for check and unchecked check boxes to the page and then apply text extraction to get the text including the marks. I would propose, though, to use DingBats like ✔ and ✗. This can be done like this:

PDDocument document = ...;
PDType1Font font = PDType1Font.ZAPF_DINGBATS;
for (PDPage page : document.getPages())
{
    PdfCheckBoxFinder finder = new PdfCheckBoxFinder(page);
    finder.processPage(page);
    for (CheckBox checkBox : finder.getBoxes()) {
        Point2D ll = checkBox.getLowerLeft();
        Point2D ur = checkBox.getUpperRight();
        String checkBoxString = checkBox.isChecked() ? "u2714" : "u2717";
        try (   PDPageContentStream canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true)) {
            canvas.beginText();
            canvas.setNonStrokingColor(1, 0, 0);
            canvas.setFont(font, (float)(ur.getY()-ll.getY()));
            canvas.newLineAtOffset((float)ll.getX(), (float)ll.getY());
            canvas.showText(checkBoxString);
            canvas.endText();
        }
    }
}
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);

(ExtractCheckBoxes test testExtractInlinedInTextFromUpdatedForm)

For your example PDF one gets

1. Have you met or discussed with principal life to be assured?   ✔ Yes  ✗ No
2. Is the principal life to be assured an existing bank customer?   ✔ Yes  ✗ No
3. Are you related to the proposed Life to be Assured? If yes, please state your relationship with applicant   ✗ Yes  ✔ No
4. Are you satisfied with the financial standing of the proposed Life to be Assured?   ✔ Yes  ✗ No
   What is the estimated annual income of the Life to be Assured? 600000
...


Source: stackoverflow