Skip to content
Advertisement

How to save a Jsoup Document to an HTML file?

I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:

myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();

How should I write this object to a HTML file? The methods myDoc.html(), myDoc.text() and myDoc.toString() don’t output all elements of the document.

Some information in a javascript element can be lost in parsing it. For example, “timestamp” in the source of an Instagram media page.

Advertisement

Answer

The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.

In order to get the server’s exact output without any form of normalization use this.

Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());

Advertisement