I want to join two dataframes based on the following condition: if df1.col(“name”)== df2.col(“name”) and df1.col(“starttime”) is greater than df2.col(“starttime”).
the first part of the condition is ok, I use “equal” method of the column class in spark sql, but for the “greater than” condition, when I use the following syntax in java”:
df1.col("starttime").gt(df2.col("starttime"))
It does not work, It seems “gt” function of column in spark sql, only accepts numerical value types, it does not work properly when you pass column type as its input parameter. The program finishes normally but the results are wrong, it does not find any rows in the dataframe that satisfy my condition, while I know that such rows exist in the dataframe.
any idea on how should I implement comparison between two column types in spark sql?(e.g. if one column is greater than other column in another dataframe)
Advertisement
Answer
I ran the following code:
HiveContext sqlContext = new HiveContext(sc); List<Event> list = new ArrayList<>(); list.add(new Event(1, "event1", Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-03 00:00:00"))); list.add(new Event(2, "event2", Timestamp.valueOf("2017-01-02 00:00:00"), Timestamp.valueOf("2017-01-03 00:00:00"))); List<Event> list2 = new ArrayList<>(); list2.add(new Event(1, "event11", Timestamp.valueOf("2017-01-02 00:00:00"), Timestamp.valueOf("2017-01-10 00:00:00"))); list2.add(new Event(2, "event22", Timestamp.valueOf("2017-01-01 00:00:00"), Timestamp.valueOf("2017-01-15 00:00:00"))); DataFrame df1 = getDF(sc, sqlContext, list); DataFrame df2 = getDF(sc, sqlContext, list2); df1.join(df2,df1.col("startTime").gt(df2.col("startTime"))).show();
And here is the result I got:
+---+------+--------------------+--------------------+---+-------+--------------------+--------------------+ | id| name| startTime| endTime| id| name| startTime| endTime| +---+------+--------------------+--------------------+---+-------+--------------------+--------------------+ | 2|event2|2017-01-02 00:00:...|2017-01-03 00:00:...| 2|event22|2017-01-01 00:00:...|2017-01-15 00:00:...| +---+------+--------------------+--------------------+---+-------+--------------------+--------------------+
Seems to me like it works as expected.
Also, the spark code (version 1.6 here) says the same story.