Skip to content
Advertisement

How to stop my code from returning UnknownHostException when getting search results of a website?

I have written a Java program that uses the Jsoup library to search for something on “freewebnovel.com” and then print out the search results. It was working around a week ago pretty consistently but now it gives out Java.net.UnknownHostException every time I run it. I checked the website to see if there was any change but I couldn’t find anything. I have added a UserAgent and that didn’t really help. I am also curious if the slash at the end of the link makes a difference.

import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.Scanner;

import org.jsoup.Connection;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class testingFreeWebnovelSearchWithJsoup {
    public static void main(String[] args){
        try{
            Scanner scan = new Scanner(System.in);
            System.out.println("Type in what you want to search:");
            String searchTerm = scan.nextLine();
            Response response = Jsoup.connect("https://freewebnovel.com/search/")
                    .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36")
                    .timeout(10000)
                    .method(Connection.Method.POST)
                    .data("searchkey", searchTerm)
                    .followRedirects(true)
                    .execute();

            Document doc = response.parse();
            Map<String, String> mapCookies = response.cookies();


            Elements searchResults = doc.select("img[src$=.jpg]");
            List<String> titles = searchResults.eachAttr("title");
            List<String> images = searchResults.eachAttr("src");

            System.out.println(titles);
        } catch(IOException e){
            System.out.println("You had an error: " + e);
        }
    }
}

Advertisement

Answer

It was working around a week ago pretty consistently but now it gives out Java.net.UnknownHostException every time I run it.

That means that the site’s DNS name was not resolving. That might be that the entry was missing … or stale … or there was a local (to you) with your bind configs or upstream DNS server.

Ah the error is not UnknownHostException, it is:

You had an error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=[freewebnovel.com/search] –

That means the server is saying Forbidden to you. Note this is a 403 not a 401.

Apparently, the server doesn’t want you fetching that page … like that.

(Did you check to see if there is body in the 403 response? That might contain an error message that is a bit more informative.)

I have added a UserAgent and that didn’t really help.

Sites have a variety of ways of detecting people doing web scraping. Spoofing the “UserAgent” field to pretend your client is a browser is one of the easier things to detect.

(I don’t think it would be a good idea to provide you with a tutorial on how to scrape sites that don’t want to be scraped.)

I am also curious if the slash at the end of the link makes a difference.

I doubt it. When I access the site from a browser, the trailing slash makes no difference.


Now I checked the site’s terms and conditions page and it doesn’t mention web scraping. Also the site’s robots.txt seems to say that robots are welcome everywhere.

But the T&C page has spelling errors, etc, so my guess is that it was thrown together in a hurry, and may not reflect the site owner’s current wishes on scraping.

There is a contact email for the site. So my advice is to email them explain what you are doing (and why!) and ask them how to proceed. If they don’t want you scraping their site, they should tell you. (And you should stop trying!)

But note that this was working before but now gives a 403 could mean they have seen your and other people’s scraping activity and are trying to block it off. (And haven’t update the T&C’s yet to make their wishes known.)

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement