Skip to content
Advertisement

how to remove Arabic hashtags?

I am polling tweets from twitter using Twitter4j and I am trying to filter hashtags from it after I take text from it I turn it into strings now I have this String: “892698363371638784:RT @hikids_ksa: اللعبة خطيرة مرا ويبي لها مخ و تفكير و مهارة👌🏻💡 متوفرة في #متجر_هاي_كيدز_الالكتروني ..”

I want to remove متجر_هاي_كيدز_الالكتروني as it has Hashtag after it using java

the problem my code didn’t work on this input: “@kaskasomar هيدا بلا مخ متل متل غيرو بيخون الشعب اللبناني وبيتهمو بالارهاب بس لان رأيو بيختلف عن رأي الاخرين #سخيف”

the part سخيف wasn’t removed for some reason this is my method

static String removeHashtags(String in)
{
    in = in.replaceAll("#[A-Za-z]+","");//remove English hashtags
    in = in.replaceAll("[أ-ي]#+","");//remove Arabic hashtags that have # before it
    return in = in.replaceAll("#[أ-ي]+","");//remove Arabic hashtags that have # after it
}

Advertisement

Answer

If you’re just trying to remove all hash tags in any language, you can write

in = in.replaceAll("#\p{IsAlphabetic}+", "");

If you specifically want to remove Arabic hash tags, you can write

in = in.replaceAll("#\p{IsArabic}+", "");

so you don’t have to worry about building a regular expression with left-to-right and right-to-left parts. This improves the readability of your code.

Advertisement