I created a parquet-structure from a csv file using spark:
Dataset<Row> df = park.read().format("com.databricks.spark.csv").option("inferSchema", "true") .option("header", "true").load("sample.csv"); df.write().parquet("sample.parquet");
I’m reading the parquet-structure and I’m trying to transform the data in a dataset:
Dataset<org.apache.spark.sql.Row> df = spark.read().parquet("sample.parquet"); df.createOrReplaceTempView("tmpview"); Dataset<Row> namesDF = spark.sql("SELECT *, md5(station_id) as hashkey FROM tmpview");
Unfortunately I get a data type mismatch error. Do I have to explicitly assign data types?
17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT *, md5(station_id) as hashkey FROM tmpview Exception in thread “main” org.apache.spark.sql.AnalysisException: cannot resolve ‘md5(tmpview.
station_id
)’ due to data type mismatch: argument 1 requires binary type, however, ‘tmpview.station_id
‘ is of int type.; line 1 pos 10; ‘Project [station_id#0, bikes_available#1, docks_available#2, time#3, md5(station_id#0) AS hashkey#16] +- SubqueryAlias tmpview,tmpview
+- Relation[station_id#0,bikes_available#1,docks_available#2,time#3] parquet
Advertisement
Answer
Yes, as per Spark documentation, md5
function works only on binary
(text/string) columns so you need to cast station_id
into string
before applying md5
. In Spark SQL, you can chain both md5
and cast
together, e.g.:
Dataset<Row> namesDF = spark.sql("SELECT *, md5(cast(station_id as string)) as hashkey FROM tmpview");
Or you can create a new column in dataframe and apply md5
on it, e.g.:
val newDf = df.withColumn("station_id_str", df.col("station_id").cast(StringType)) newDf.createOrReplaceTempView("tmpview"); Dataset<Row> namesDF = spark.sql("SELECT *, md5(station_id_str) as hashkey FROM tmpview");