How to abbreviate HTML with Java?

Tags: , , , ,



A user enters text as HTML in a form, for example:

<p>this is my <strong>blog</strong> post, 
very <i>long</i> and written in <b>HTML</b></p>

I want to be able to output only a part of the string ( for example the first 20 characters ) without breaking the HTML structure of the user’s input. In this case:

<p>this is my <strong>blog</strong> post, very <i>l</i>...</p>

which renders as

this is my <strong>blog</strong> post, very <i>lo</i>...

Is there a Java library able to do this, or a simple method to use?

MyLibrary.abbreviateHTML(string,20) ?

Answer

Since it’s not very easy to do this correctly I usually strip all tags and truncate. This gives great control on the text size and appearance which usually needs to be placed in places where you do need control.

Note that you may find my proposal very conservative and it actually is not a proper answer to your question. But most of the times the alternatives are:

  • strip all tags and truncate
  • provide an alternate content manageable rich text which will serve as the truncated text. This of course only works in the case of CMSes etc

The reason that truncating HTML would be hard is that you don’t know how truncating would affect the structure of the HTML. How would you truncate in the middle of a <ul> or, even worst, in the middle of a complex <table>?

So the problem here is that HTML can not only contain content and styling (bold, italics) but also structure (lists, tables, divs etc). So a good and safe implementation would be to strip everything out apart inline “styling” tags (bold, italics etc) and truncate while keeping track of unclosed tags.



Source: stackoverflow