Skip to content
Advertisement

Spark Dataset Foreach function does not iterate

Context

I want to iterate over a Spark Dataset and update a HashMap for each row.

Here is the code I have:

JavaScript

Issue

My issue is that the foreach doesn’t iterate at all, the lambda is never executed and I don’t know why.
I implemented it as indicated here: How to traverse/iterate a Dataset in Spark Java?

At the end, all the inner Vectors remain empty (as they were initialized) despite the Dataset is not (Take a look to the first comments in the given code sample).

I know that the foreach never iterates because I did two tests:

  • Add an AtomicInteger to count the iterations, increment it right in the beginning of the lambda with incrementAndGet() method. => The counter value remains 0 at the end of the process.
  • Print a debug message right in the beginning of the lambda. => The message is never displayed.

I’m not used of Java (even less with Java lambdas) so maybe I missed an important point but I can’t find what.

Advertisement

Answer

I am probably a little old school, but I never like lambdas too much, as it can get pretty complicated.

Here is a full example of a foreach():

JavaScript

As you can see, this example reads a CSV file and prints a message from the data. It is fairly simple.

The foreach() instantiates a new class, where the work is done.

JavaScript

The work is done in the call() method of the class:

JavaScript

As you are new to Java, make sure you have the right signature (for classes and methods) and the right imports.

You can also clone the example from https://github.com/jgperrin/net.jgp.labs.spark/tree/master/src/main/java/net/jgp/labs/spark/l240_foreach/l000. This should help you with foreach().

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement