Spark (JAVA) – dataframe groupBy with multiple aggregations?

Tags: ,



I’m trying to write a groupBy on Spark with JAVA. In SQL this would look like

SELECT id, count(id) as count, max(date) maxdate
FROM table
GROUP BY id;

But what is the Spark/JAVA style equivalent of this query? Let’s say the variable table is a dataframe, to see the relation to the SQL query. I’m thinking something like:

table = table.select(table.col("id"), (table.col("id").count()).as("count"), (table.col("date").max()).as("maxdate")).groupby("id")

Which is obviously incorrect, since you can’t use aggregate functions like .count or .max on columns, only dataframes. So how is this done in Spark JAVA?

Thank you!

Answer

You could do this with org.apache.spark.sql.functions:

import org.apache.spark.sql.functions;

table.groupBy("id").agg(
    functions.count("id").as("count"),
    functions.max("date").as("maxdate")
).show();


Source: stackoverflow