I am polling tweets from twitter using Twitter4j and I am trying to filter hashtags from it after I take text from it I turn it into strings now I have this String: “892698363371638784:RT @hikids_ksa: اللعبة خطيرة مرا ويبي لها مخ و تفكير و مهارة👌🏻💡 متوفرة في #متجر_هاي_كيدز_الالكتروني ..”
I want to remove متجر_هاي_كيدز_الالكتروني as it has Hashtag after it using java
the problem my code didn’t work on this input: “@kaskasomar هيدا بلا مخ متل متل غيرو بيخون الشعب اللبناني وبيتهمو بالارهاب بس لان رأيو بيختلف عن رأي الاخرين #سخيف”
the part سخيف wasn’t removed for some reason this is my method
static String removeHashtags(String in) { in = in.replaceAll("#[A-Za-z]+","");//remove English hashtags in = in.replaceAll("[أ-ي]#+","");//remove Arabic hashtags that have # before it return in = in.replaceAll("#[أ-ي]+","");//remove Arabic hashtags that have # after it }
Advertisement
Answer
If you’re just trying to remove all hash tags in any language, you can write
in = in.replaceAll("#\p{IsAlphabetic}+", "");
If you specifically want to remove Arabic hash tags, you can write
in = in.replaceAll("#\p{IsArabic}+", "");
so you don’t have to worry about building a regular expression with left-to-right and right-to-left parts. This improves the readability of your code.