Skip to content
Advertisement

How to bundle tesseract-ocr with a serverless Java application built for Azure Functions?

I am adding Apache Tika for extracting text out of documents and images (with TikaOcr) to an already existing service in the Azure Functions based on top of AppService. Now, Apache Tika requires tesseract to be installed in the machine locally. To overcome that, I used apt-get to set up (by ssh-ing) into the server but (from what I understand) the setup is performed on the base AppService layer. As a result, invocation of concurrent OCR commands really slow down my functions. Since there are no official binaries of Tesseract, I was wondering if any of the following is possible:

  1. Bundle Tesseract with my Functions app
  2. Build a docker image with Tesseract.
  3. Build a multi-container docker app with a tesseract runtime image (tesseract-shadow/tesseract-ocr-re)

I have tried to build docker image (following instructions from here) with tesseract with the following dockerfile but Apache Tika fails to perform OCR with this.

ARG JAVA_VERSION=11

# This image additionally contains function core tools – useful when using custom extensions
#FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-core-tools AS installer-env
FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-build AS installer-env

RUN apt-get update && apt-get install -y tesseract-ocr

COPY . /src/functions-tika-extraction
RUN cd /src/functions-tika-extraction && 
    mkdir -p /home/site/wwwroot && 
    mvn clean package && 
    cd ./target/azure-functions/ && 
    cd $(ls -d */|head -n 1) && 
    cp -a . /home/site/wwwroot

# This image is ssh enabled
FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-appservice
# This image isn't ssh enabled
#FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION

ENV AzureWebJobsScriptRoot=/home/site/wwwroot 
    AzureFunctionsJobHost__Logging__Console__IsEnabled=true

COPY --from=installer-env ["/home/site/wwwroot", "/home/site/wwwroot"]

I’m fairly new to Docker and Azure Platform so I may be missing something here, but how can I get my Azure Functions to work with Tesseract using Docker or any other method?

Advertisement

Answer

After reading through the docker docs and getting to know some basics about docker, I could finally figure out that tesseract was in fact installed, below Azure AppService layer which somehow does not allow a container to access it. Tesseract can be made available to Azure Functions if installed in the uppermost layer by including it in the bottom of the Dockerfile as follows:

ARG JAVA_VERSION=11

FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-build AS installer-env

# remove this line
# RUN apt-get update && apt-get install -y tesseract-ocr

COPY . /src/functions-tika-extraction
RUN cd /src/functions-tika-extraction && 
    mkdir -p /home/site/wwwroot && 
    mvn clean package && 
    cd ./target/azure-functions/ && 
    cd $(ls -d */|head -n 1) && 
    cp -a . /home/site/wwwroot

# This image is ssh enabled
FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-appservice

# add the line here
RUN apt-get update && apt-get install -y tesseract-ocr

ENV AzureWebJobsScriptRoot=/home/site/wwwroot 
    AzureFunctionsJobHost__Logging__Console__IsEnabled=true

COPY --from=installer-env ["/home/site/wwwroot", "/home/site/wwwroot"]

While it does satisfy my requirement of bundling tesseract-ocr with Azure Functions Java application, the invocation is still very slow unfortunately.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement