I already have the sentences in a RDD and the output looks like:
RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It’s in the rules. #Eurovision2018 RT @Mystificus: Of course I’ll watch #eurovision tonight. After all, 200 million people can’t be wrong, can they? Er…ðð«… RT @KlNGNEUER: Me when Europeans make fun of Eurovision VS Me when Americans make fun of Eurovision
#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching Eurovision… @AndrewDawes71 @SuzanneEvans1 @ConstantinStHe1 The tweet was directed at citizens of other countries partaking in t… Looking forward to @Eurovision @bbceurovision tonight and rooting for @surieofficial who has strong competition. Sh… RT @Jem_Collins: Media and journalism friends, I need you to do something during #Eurovision this evening. And that something is to drink a… Getting ready for anime AND Eurovision with friends tonight! ð
But when I try to split it by “.” and “,” I only get a empty txt using this code:
JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator()); JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
Where lines is an RDD with the content of the screenshot.
After that, how can I construct the bigrams?
REPRODUCE EXAMPLE:
SparkConf conf = new SparkConf().setAppName("BiGramsApp"); JavaSparkContext sparkContext = new JavaSparkContext(conf); JavaRDD<String> inputFile = sparkContext.textFile(input); JavaRDD<String> sentences = inputFile.flatMap( line -> Arrays.asList(line.split(".")).iterator()); JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator()); words.saveAsTextFile(outputDir);
The input file will be a .txt with any sentence, but you can try with the strings that are write at the beginning
Advertisement
Answer
The solution to split is add the pattern between "[.]"
or "[ ]"