I have a pdf document which contains images, hyperlinks , words and many other things.
I want to search for a sting in all the words, i.e images and hyperlinks are excluded. How to write a java code with that. Could someone help here.
Advertisement
Answer
You can use the PDFbox library of Apache (https://pdfbox.apache.org/download.cgi). Here is an example of code.
import java.util.Scanner; import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; public class Main { public static void main(String args[]) throws IOException { Scanner scan = new Scanner(System.in); System.out.println("Type the directory of the PDF File : "); String PDFdir = scan.nextLine(); System.out.println("Input the phrase to find"); String phrase = scan.nextLine(); File file = new File(PDFdir); PDDocument doc = PDDocument.load(file); PDFTextStripper findPhrase = new PDFTextStripper(); String text = findPhrase.getText(doc); String PDF_content = text; String result = PDF_content.contains(phrase) ? "Yes" : "No" System.out.println(result); doc.close(); } }
Remember you will have to download PDFbox jar file and import it into your project.
Output/Result :
Edit:
You can also find the number of phrases in the PDF :
if (result.equals("Yes")) { int counter = 0; while(PDF_content.contains(phrase)) { counter++; PDF_content = PDF_content.replaceFirst(phrase, ""); } System.out.println(counter); }