Java – Multi-threaded crawler with ExecutorService

Question

I&#8217;m working to make a crawler in Java. I made a single-threaded crawler to visit a single page and fetch all links on that page. Now I want to make it multi-threaded but facing difficulties. In the very beginning I start with single link of the page and crawl through all the links in it and Now I want t…

Accepted Answer

I think that what you need to do is to handle in the Runnables the url visiting part only, which means that the Runnable class will be kind of this : public class MyCrawler implements Runnable { URI uri; public MyCrawler(String url) { this.uri = URI.create(url); } @Override public void run() { try{ VisitPage(url); }catch(Exception e){ e.printStackTrace(); } } private void VisitPage(String url){ List linksOnthisPage = new ArrayList<>(); if(!url.contains("javascript") && !url.contains("#")){ try{ Document doc = Jsoup.connect(url).timeout(0).get(); Elements linkTags = doc.select("a[href]"); for(Element e : linkTags){ String link = e.attr("href"); if(!link.contains("#") && !link.contains("javascript") && !link.equals(url)){ if(link.startsWith("http") || link.startsWith("www")){ if(link.contains(uri.getHost())){ linksOnthisPage.add(link); }else{ System.out.println("SOME OTHER WEBSITE -- " + link); } }else if(link.startsWith("/")){ link = url + link.substring(1, link.length()); linksOnthisPage.add(link); }else{ System.out.println("LINK IGNORED DUE TO -- " + url); } }else{ System.out.println("LINK IGNORED -- " + url); } } System.out.println("nnLinks found in "" + url+ "" : " + linksOnthisPage.size()); }catch(Exception e){ System.out.println("EXCEPTION -- " + url); return; } }else{ System.out.println("UNWANTED URL -- " + url); } }}Next loop over the links and add a job to the executor for each url,( you can do this in your main method or do it in a new class), the code snippet will look like this :for(String url : unvisitedLinks ){{ Runnable worker = new MyCrawler(url); executor.execute(worker);}

Advertisement

Answer