Skip to content
Advertisement

To remove Unicode character from String in Java using REGEX

I am having Input String like below.

String comment = "Good morning! u2028u2028I am looking to purchase a new Honda car as Iu2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
 little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

I want to remove Unicode characters like “u2028” , “u2019” etc if it is present in the comment section.In runtime i don’t know what are all extra characters coming. So what is the best way to handle this?

I tried like below which removes unicode characters in the given string.

Comments.replaceAll("\P{Print}", "");

So what is the best way to match Unicode characters are present in the comment section and if present remove those, otherwise just pass the comment to target system.

Can anyone please help me to resolve this?

Advertisement

Answer

You can do this sequentially like below:

public static void main(final String args[]) {
    String comment = "Good morning! u2028u2028I am looking to purchase a new Honda car as Iu2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

    // remove all non-ASCII characters
    comment = comment.replaceAll("[^\x00-\x7F]", "");

    // remove all the ASCII control characters
    comment = comment.replaceAll("[\p{Cntrl}&&[^rnt]]", "");

    // removes non-printable characters from Unicode
    comment = comment.replaceAll("\p{C}", "");
    System.out.println(comment);
}
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement