Data type mismatch while transforming data in spark dataset

Question

I created a parquet-structure from a csv file using spark: I&#8217;m reading the parquet-structure and I&#8217;m trying to transform the data in a dataset: Unfortunately I get a data type mismatch error. Do I have to explicitly assign data types? 17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT …

Accepted Answer

Yes, as per Spark documentation, md5 function works only on binary (text/string) columns so you need to cast station_id into string before applying md5. In Spark SQL, you can chain both md5 and cast together, e.g.:Dataset namesDF = spark.sql("SELECT *, md5(cast(station_id as string)) as hashkey FROM tmpview");Or you can create a new column in dataframe and apply md5 on it, e.g.:val newDf = df.withColumn("station_id_str", df.col("station_id").cast(StringType))newDf.createOrReplaceTempView("tmpview");Dataset namesDF = spark.sql("SELECT *, md5(station_id_str) as hashkey FROM tmpview");

Advertisement

Answer