I need to scrape a website with content ‘inserted’ by Angular. And it needs to be done with java.
I have tried Selenium Webdriver (as I have used Selenium before for scraping less dynamic webpages). But I have no idea how to deal with the Angular part. Apart from the script tags in the head section of the page, there is only one place in the site where there are Angular attributes:
<div data-ng-module="vindeenjob"><div data-ng-view=""></div>
I found this article here, but honestly… I can’t figure it out. It seems like the author is selecting (lets call them ) ‘ng-attributes’ like this
WebElement theForm = wd.findElement(By.cssSelector("div[ng-controller='UserForm']"));
but fails to explain why he does what he does. In the source code of his demo page, I cant find anything that is called ‘UserForm’… So the why remains a mystery.
Then I tried setting a timeinterval for Selenium, in hopes that the page would be rendered and that I eventually can grab the results after the wait period, like this:
WebDriver webdriver = new HtmlUnitDriver(); webdriver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS); webdriver.get("https://www.myurltoscrape.com");
But to no avail. Then there is also this article, which gives some interesting exceptions, such as Cannot set property [HTMLStyleElement].media that has only a getter to all. which basically means that there might be something wrong with the javascript. However, HtmlUnit does seems to realize that there is javascript on the page, which is more then I got before. I do realize (as I did a search on the exceptions) that there is a feature in HtmlUnit which should make sure that you don’t see the javascript exceptions. I turned it off, but I get exceptions anyway. Here is the code:
webClient.getOptions().setThrowExceptionOnScriptError(false);
I would post more code, but basically nothing scrapes the dynamic content and I am pretty sure that it is not the code that is wrong, it merely is not the correct solution yet.
Can I get some help please?
Advertisement
Answer
In the end, I have followed Madusudanan ‘s excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.
You can find the maven dependency here. Here is more info on ghost driver.
The setup in Maven- I have added the following:
<dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.41.0</version> </dependency> <dependency> <groupId>com.github.detro</groupId> <artifactId>phantomjsdriver</artifactId> <version>1.2.0</version> </dependency>
It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn’t compatible with every version of Selenium, but I guess they addressed that problem in the meantime.
If you are already using a Selenium/Phantomdriver combination and you are getting ‘strict javascript errors’ on a certain site, update your version of selenium. That will fix it.
And here is some sample code:
public void testPhantomDriver() throws Exception { DesiredCapabilities options = new DesiredCapabilities(); // the website i am scraping uses ssl, but I dont know what version options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] { "--ssl-protocol=any" }); PhantomJSDriver driver = new PhantomJSDriver(options); driver.get("https://www.mywebsite"); List<WebElement> elements = driver.findElementsByClassName("media-title"); for(WebElement element : elements ){ System.out.println(element.getText()); } driver.quit(); }