Skip to content
Advertisement

HtmlUnit Scraping Xpath from Div

I am trying to scrape the contents of the google movies page, i want the name of the theater, the address and the time. As you can see in the google movie page each block of that information is inside a div with a class named theater, and inside that div theres the name, address and times of each theater.

So what i did was use htmlunit to extract a List of theater divs:

List<HtmlDivision> div =  (List<HtmlDivision>) page.getByXPath("//div[@class='theater']");

When printing the contents of the list i get the expected result:

System.out.println(div.get(0).asText());

Regal Battery Park Stadium 11
102 North End Avenue, New York, NY
1:00‎ ‎4:10‎ ‎7:20‎ ‎10:35pm‎

Now i want to split this information into name, address and times, the problem is that when i do:

System.out.println("Theater " + div.get(0).getByXPath("//div[@class='name']/a/text()"));

The result is the name of every single theater in the page:

Theater [Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, AMC Village 7, UA Court Street Stadium 12 & RPX, Cobble Hill Cinemas, AMC Loews 19th St. East 6, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Pavilion Cinema, AMC Village 7, UA Court Street Stadium 12 & RPX, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Frank Theatres - South Cove Stadium 12]

How is it possible that i am getting all the Theaters if i am doing a getByXpath inside an object that doesnt even have that information?

Advertisement

Answer

You need to add a dot (.) at the beginning of the XPath to indicate that it meant to be relative to current context element which in this case is the first div (div.get(0)). Otherwise the XPath will ignore the context element and search for matching elements starting from the root :

div.get(0).getByXPath(".//div[@class='name']/a/text()")
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement