I am adding Apache Tika for extracting text out of documents and images (with TikaOcr) to an already existing service in the Azure Functions based on top of AppService. Now, Apache Tika requires tesseract to be installed in the machine locally. To overcome that, I used apt-get to set up (by ssh-ing) into the server but (from what I understand) the setup is performed on the base AppService layer. As a result, invocation of concurrent OCR commands really slow down my functions. Since there are no official binaries of Tesseract, I was wondering if any of the following is possible:
- Bundle Tesseract with my Functions app
- Build a docker image with Tesseract.
- Build a multi-container docker app with a tesseract runtime image (tesseract-shadow/tesseract-ocr-re)
I have tried to build docker image (following instructions from here) with tesseract with the following dockerfile but Apache Tika fails to perform OCR with this.
ARG JAVA_VERSION=11 # This image additionally contains function core tools – useful when using custom extensions #FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-core-tools AS installer-env FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-build AS installer-env RUN apt-get update && apt-get install -y tesseract-ocr COPY . /src/functions-tika-extraction RUN cd /src/functions-tika-extraction && mkdir -p /home/site/wwwroot && mvn clean package && cd ./target/azure-functions/ && cd $(ls -d */|head -n 1) && cp -a . /home/site/wwwroot # This image is ssh enabled FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-appservice # This image isn't ssh enabled #FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION ENV AzureWebJobsScriptRoot=/home/site/wwwroot AzureFunctionsJobHost__Logging__Console__IsEnabled=true COPY --from=installer-env ["/home/site/wwwroot", "/home/site/wwwroot"]
I’m fairly new to Docker and Azure Platform so I may be missing something here, but how can I get my Azure Functions to work with Tesseract using Docker or any other method?
Advertisement
Answer
After reading through the docker docs and getting to know some basics about docker, I could finally figure out that tesseract was in fact installed, below Azure AppService layer which somehow does not allow a container to access it. Tesseract can be made available to Azure Functions if installed in the uppermost layer by including it in the bottom of the Dockerfile as follows:
ARG JAVA_VERSION=11 FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-build AS installer-env # remove this line # RUN apt-get update && apt-get install -y tesseract-ocr COPY . /src/functions-tika-extraction RUN cd /src/functions-tika-extraction && mkdir -p /home/site/wwwroot && mvn clean package && cd ./target/azure-functions/ && cd $(ls -d */|head -n 1) && cp -a . /home/site/wwwroot # This image is ssh enabled FROM mcr.microsoft.com/azure-functions/java:3.0-java$JAVA_VERSION-appservice # add the line here RUN apt-get update && apt-get install -y tesseract-ocr ENV AzureWebJobsScriptRoot=/home/site/wwwroot AzureFunctionsJobHost__Logging__Console__IsEnabled=true COPY --from=installer-env ["/home/site/wwwroot", "/home/site/wwwroot"]
While it does satisfy my requirement of bundling tesseract-ocr with Azure Functions Java application, the invocation is still very slow unfortunately.