How to stop my code from returning UnknownHostException when getting search results of a website?

Question

I have written a Java program that uses the Jsoup library to search for something on "freewebnovel.com" and then print out the search results. It was working around a week ago pretty consistently but now it gives out Java.net.UnknownHostException every time I run it. I checked the website to see if there was any change but I couldn't find anything.

Accepted Answer

It was working around a week ago pretty consistently but now it gives out Java.net.UnknownHostException every time I run it.That means that the site&#8217;s DNS name was not resolving.  That might be that the entry was missing &#8230; or stale &#8230; or there was a local (to you) with your bind configs or upstream DNS server.Ah the error is not UnknownHostException, it is:You had an error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=[freewebnovel.com/search] –That means the server is saying Forbidden to you.  Note this is a 403 not a 401.Apparently, the server doesn&#8217;t want you fetching that page &#8230; like that.(Did you check to see if there is body in the 403 response?  That might contain an error message that is a bit more informative.)I have added a UserAgent and that didn&#8217;t really help.Sites have a variety of ways of detecting people doing web scraping.  Spoofing the &#8220;UserAgent&#8221; field to pretend your client is a browser is one of the easier things to detect.(I don&#8217;t think it would be a good idea to provide you with a tutorial on how to scrape sites that don&#8217;t want to be scraped.)I am also curious if the slash at the end of the link makes a difference.I doubt it.  When I access the site from a browser, the trailing slash makes no difference.Now I checked the site&#8217;s terms and conditions page and it doesn&#8217;t mention web scraping.  Also the site&#8217;s robots.txt seems to say that robots are welcome everywhere.But the T&C page has spelling errors, etc, so my guess is that it was thrown together in a hurry, and may not reflect the site owner&#8217;s current wishes on scraping.There is a contact email for the site.  So my advice is to email them explain what you are doing (and why!) and ask them how to proceed.  If they don&#8217;t want you scraping their site, they should tell you.  (And you should stop trying!)But note that this was working before but now gives a 403 could mean they have seen your and other people&#8217;s scraping activity and are trying to block it off.  (And haven&#8217;t update the T&C&#8217;s yet to make their wishes known.)

Advertisement

Answer