Skip to content
Advertisement

BiGrams Spark using java

I already have the sentences in a RDD and the output looks like:

RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It’s in the rules. #Eurovision2018 RT @Mystificus: Of course I’ll watch #eurovision tonight. After all, 200 million people can’t be wrong, can they? Er…🍊🔫… RT @KlNGNEUER: Me when Europeans make fun of Eurovision VS Me when Americans make fun of Eurovision

#Eurovision #EuroSemi2 Eurovision song contest 2018 tonight!!!!!! Saturday chills with bae, hands up who’s not watching Eurovision… @AndrewDawes71 @SuzanneEvans1 @ConstantinStHe1 The tweet was directed at citizens of other countries partaking in t… Looking forward to @Eurovision @bbceurovision tonight and rooting for @surieofficial who has strong competition. Sh… RT @Jem_Collins: Media and journalism friends, I need you to do something during #Eurovision this evening. And that something is to drink a… Getting ready for anime AND Eurovision with friends tonight! 😄

But when I try to split it by “.” and “,” I only get a empty txt using this code:

JavaRDD<String> sentences= lines.flatMap( line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());

Where lines is an RDD with the content of the screenshot.

After that, how can I construct the bigrams?

REPRODUCE EXAMPLE:

SparkConf conf = new SparkConf().setAppName("BiGramsApp");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
JavaRDD<String> inputFile = sparkContext.textFile(input);
JavaRDD<String> sentences = inputFile.flatMap(  line -> Arrays.asList(line.split(".")).iterator());
JavaRDD<String> words = sentences.flatMap( line -> Arrays.asList(line.split(" ")).iterator());
    
words.saveAsTextFile(outputDir);

The input file will be a .txt with any sentence, but you can try with the strings that are write at the beginning

Advertisement

Answer

The solution to split is add the pattern between "[.]" or "[ ]"

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement