Skip to content
Advertisement

Replace ASCII codes and HTML tags in Java

How can i achieve below expecting results without using StringEscapeUtils ?

public class Main {
    public static void main(String[] args) throws Exception {
      String str = "<p><b>Send FWB <br><br> (if AWB has COU SHC, <br> if ticked , will send FWB)</b></p>";
      str = str.replaceAll("\<.*?\>", "");
      System.out.println("After removing HTML Tags: " + str);
    }
}

Current Results:

After removing HTML Tags: Send FWB  (if AWB has COU SHC,  if ticked , will send FWB)

Expecting Results:

After removing HTML Tags: Send FWB  if AWB has COU SHC,  if ticked , will send FWB;

Already checked: How to unescape HTML character entities in Java?


PS: This is just a sample example, input may vary.

Advertisement

Answer

Your regexp is for html tags <something> would be matched byt the html entities will not be matched. Their pattern is something like &.*?; Which you are not replacing.

this should solve your trouble:

str = str.replaceAll("\<.*?\>|&.*?;", "");

If you want to experiment with this in a sandbox, try regxr.com and use (<.*?>)|(&.*?;) the brackets make the two different capturing groups easy to identify on the tool and are not needed in your code. note that the does not need to be escaped on that sandbox playground, but it has to be in your code, since it’s in a string.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement