Skip to content
Advertisement

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which HTML parser can parse HTML efficiently. I need

  1. Speed
  2. Ease to locate any HtmlElement by its “id” or “name” or “tag type”.

It would be ok for me if it doesn’t clean the dirty HTML code. I don’t need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

Advertisement

Answer

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Advertisement