Tag: ocr

PDFBox: Differentiating between transparent and non-transparent text

I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (transparent)

How to bundle tesseract-ocr with a serverless Java application built for Azure Functions?

apache-tika azure-functions docker java ocr

I am adding Apache Tika for extracting text out of documents and images (with TikaOcr) to an already existing service in the Azure Functions based on top of AppService. Now, Apache Tika requires tesseract to be installed in the machine locally. To overcome that, I used apt-get to set up (by ssh-ing) into the server but (from what I understand)

How to convert a PDF to a JSON/EXCEL/WORD file?

excel java ms-word ocr pdf

I need to get data from the pdf file with its header for further comparing with DB data I tried to use the pdfbox , google vision ocr , itext, but all libraries gave me a row without structure and headers. Example: DatenNumbernStatusn12122020n442334delivered I will trying convert pdf to excel/word and get data from them, but for this realisation i

Tess4j – Pdf to Tiff to tesseract – “Warning: Invalid resolution 0 dpi. Using 70 instead.”

java ocr tess4j tesseract

I am usig tess4j (net.sourceforge.tess4j:tess4j:4.4.0) and try OCR on pdf files. So as I understood I have to transform the pdf first to tiff or png (any of those suggested?) what I did like this: and get following warning: Question Does it has any influence on my scan results? (if not, ok – I can switch off the warning) Is