Skip to content
Advertisement

Data type mismatch while transforming data in spark dataset

I created a parquet-structure from a csv file using spark:

JavaScript

I’m reading the parquet-structure and I’m trying to transform the data in a dataset:

JavaScript

Unfortunately I get a data type mismatch error. Do I have to explicitly assign data types?

17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT *, md5(station_id) as hashkey FROM tmpview Exception in thread “main” org.apache.spark.sql.AnalysisException: cannot resolve ‘md5(tmpview.station_id)’ due to data type mismatch: argument 1 requires binary type, however, ‘tmpview.station_id‘ is of int type.; line 1 pos 10; ‘Project [station_id#0, bikes_available#1, docks_available#2, time#3, md5(station_id#0) AS hashkey#16] +- SubqueryAlias tmpview, tmpview +- Relation[station_id#0,bikes_available#1,docks_available#2,time#3] parquet

Advertisement

Answer

Yes, as per Spark documentation, md5 function works only on binary (text/string) columns so you need to cast station_id into string before applying md5. In Spark SQL, you can chain both md5 and cast together, e.g.:

JavaScript

Or you can create a new column in dataframe and apply md5 on it, e.g.:

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement