How can I efficiently parse HTML with Java?

Question

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation. Now, I want to separate both the tasks. I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and

Accepted Answer

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.Its party trick is a CSS selector syntax to find elements, e.g.:String html = "First parse" + "

Parsed HTML into a doc.

";Document doc = Jsoup.parse(html);Elements links = doc.select("a");Element head = doc.select("head").first();See the Selector javadoc for more info.This is a new project, so any ideas for improvement are very welcome!

Advertisement

Answer