I have a DataFrame with multiple columns, e.g.
root |-- playerName |-- country |-- bowlingAvg |-- bowlingSR |-- wickets |-- battingAvg |-- battingSR |-- runs
I also have a list of the column names which corresponds to bowling stats:
List bowlingParams = new ArrayList(Arrays.asList(“bowlingAvg”, “bowlingSR”, “wickets”));
Expected Schema:
root |-- playerName |-- country |-- bowlingAvg |-- bowlingSR |-- wickets |-- battingAvg |-- battingSR |-- runs |-- bowlingStats |-- bowlingAvg |-- bowlingSR |-- wickets
I can do it like this
playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg", "bowlingSR", "wickets"))
However, I want to use the list to dynamically select the column for struct.
I know we can do it like this in Scala
playerDF = playerDF.select(struct(bowlingParams.map(col): _*))
and, I have also found a reference on how to do this in Python
Is there a way we can do this in Java with Spark?
Advertisement
Answer
For java this solution worked for me,
remove the one attribute from list(non dynamic one)
convert the remaining list to Scala Sequence using JavaConverters.
when creating nested column , in struct use one attribute(as string) and your converted Scala Seq.
import scala.collection.JavaConverters; List bowlingParams = new ArrayList(Arrays.asList("bowlingSR", "wickets")); playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg",JavaConverters.asScalaIteratorConverter(bowlingParams.iterator()).asScala().toSeq()));