Skip to content
Advertisement

How to create a struct column from a list of column names in Spark with Java?

I have a DataFrame with multiple columns, e.g.

root
 |-- playerName
 |-- country
 |-- bowlingAvg
 |-- bowlingSR
 |-- wickets
 |-- battingAvg
 |-- battingSR
 |-- runs

I also have a list of the column names which corresponds to bowling stats:

List bowlingParams = new ArrayList(Arrays.asList(“bowlingAvg”, “bowlingSR”, “wickets”));

Expected Schema:

root
 |-- playerName
 |-- country
 |-- bowlingAvg
 |-- bowlingSR
 |-- wickets
 |-- battingAvg
 |-- battingSR
 |-- runs
 |-- bowlingStats 
       |-- bowlingAvg
       |-- bowlingSR
       |-- wickets

I can do it like this

playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg", "bowlingSR", "wickets"))

However, I want to use the list to dynamically select the column for struct.

I know we can do it like this in Scala

playerDF = playerDF.select(struct(bowlingParams.map(col): _*))

and, I have also found a reference on how to do this in Python

Is there a way we can do this in Java with Spark?

Advertisement

Answer

For java this solution worked for me,

  • remove the one attribute from list(non dynamic one)

  • convert the remaining list to Scala Sequence using JavaConverters.

  • when creating nested column , in struct use one attribute(as string) and your converted Scala Seq.

     import scala.collection.JavaConverters; 
    
     List bowlingParams = new ArrayList(Arrays.asList("bowlingSR", "wickets"));
    
    
    playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg",JavaConverters.asScalaIteratorConverter(bowlingParams.iterator()).asScala().toSeq()));
    
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement