I’m trying to write a groupBy on Spark with JAVA. In SQL this would look like
SELECT id, count(id) as count, max(date) maxdate FROM table GROUP BY id;
But what is the Spark/JAVA style equivalent of this query? Let’s say the variable table
is a dataframe, to see the relation to the SQL query. I’m thinking something like:
table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")
Which is obviously incorrect, since you can’t use aggregate functions like .count
or .max
on columns, only dataframes. So how is this done in Spark JAVA?
Thank you!
Advertisement
Answer
You could do this with org.apache.spark.sql.functions
:
import org.apache.spark.sql.functions; table.groupBy("id").agg( functions.count("id").as("count"), functions.max("date").as("maxdate") ).show();