PDFBox: Differentiating between transparent and non-transparent text

Question

I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (…

Accepted Answer

So the images have in front of them or behind them transparent text. I thought that meant that they have no color, but @mkl said that they might have colors, but they are empty glyphs. The pdf specification also states that they can have color even if they are transparent. To be truly transparent the characters need to be rendered with neither stroking, nor non-stroking colors.There is a RenderingMode enum in PDFBox, or Fontbox for exactly this purpose and its NEITHER value denotes whether something is transparent. I could extract it with the help of this answer.The solution code looks like this.@Overrideprotected void processTextPosition(TextPosition character) {    characterRenderingModes.put(character, getGraphicsState().getTextState().getRenderingMode());    super.processTextPosition(character);}This is an overriden method of the PDFTextStripper class and it goes through every character on the page/s and gets their RenderingModes. After that when needed I get the RenderingModes out of the map based on the characters I needed to examine.

Advertisement

Answer