I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (transparent) and the not-needed (non-transparent) text?
Here is a representative pdf file: https://easyupload.io/rbo333 Image OCR text should be extracted on page 2,3,12 but text is also extracted on page 4. On page 4 there is no OCR text behind images, but there is regular text under the image. I need to somehow filter that out as I only need OCR text.
Advertisement
Answer
So the images have in front of them or behind them transparent text. I thought that meant that they have no color, but @mkl said that they might have colors, but they are empty glyphs. The pdf specification also states that they can have color even if they are transparent. To be truly transparent the characters need to be rendered with neither stroking, nor non-stroking colors.
There is a RenderingMode enum in PDFBox, or Fontbox for exactly this purpose and its NEITHER value denotes whether something is transparent. I could extract it with the help of this answer.
The solution code looks like this.
@Override protected void processTextPosition(TextPosition character) { characterRenderingModes.put(character, getGraphicsState().getTextState().getRenderingMode()); super.processTextPosition(character); }
This is an overriden method of the PDFTextStripper class and it goes through every character on the page/s and gets their RenderingModes. After that when needed I get the RenderingModes out of the map based on the characters I needed to examine.