Skip to content
Advertisement

Jsoup HtmlToPlainText function adding extra new line

If the text is already plain text and passed to the function new HtmlToPlainText().getPlainText() then the new line character is getting added to the result text. It looks like Jsoup is doing some formatting and adding a line break.

HtmlToPlainText htmlToPlainText = new HtmlToPlainText();
htmlToPlainText.getPlainText(Jsoup.parse(inputString));

I tried outputSettings.prettyPrint(false); but it is not helping.

Input text can be HTML or plain text.

I want the text to be returned as it is(no extra new line) if it is already plain text.

Input: This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv pvckfpv jvpfkvp cnirv

Output: This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv npvckfpv jvpfkvp cnirv. A new line character is added after mpcjfpv

We can do string replacement but I am looking for a way to do it as part of the library itself.

Advertisement

Answer

HtmlToPlainText resides in package org.jsoup.examples, which is not included in the library jar file on Maven Central. In other words, this class is not part of the jsoup API and is only meant for demonstration purposes.

If you want to output the plaintext of a parsed document, try something like this instead:

Document doc = Jsoup.parse("This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv pvckfpv jvpfkvp cnirv");
System.out.println(doc.text());
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement