Tag: parquet

A parquet file of a dataset having a String field containing leading zeroes returns that field without leading zeroes, if it is paritionned by it

I have a Dataset gathering informations about French cities, and the field that is troubling me is the department one (CodeDepartement). When the Dataset isn’t partitioned by this String field codeDepartement: everything is working well When that function runs, if I don’t attempt to partition the dataset (the required statements for partitioning are commented here), everything goes fine: The content

AvroParquetOutputFormat – Unable to Write Arrays with Null Elements

avro java parquet

I’m using v1.11.1 of the parquet-mr library as part of a Java application that takes Avro records and writes them into Parquet files using the AvroParquetOutputFormat. There are Avro records with array type fields that will have null elements, e.g. Here’s an example Avro schema: I’m trying to write the following record: I thought I could use the 3-level list

Dataflow writing a pCollection of GenericRecords to Parquet files

apache-beam dataflow java parquet

In apache beam step I have a PCollection of KV<String, Iterable<KV<Long, GenericRecord>>>>. I want to write all the records in the iterable to the same parquet file. My code snippet is given below now I want to write all the Records in the Iterable in the same parquet file(derive the file name by the key of KV). Answer I found

Data type mismatch while transforming data in spark dataset

apache-spark apache-spark-dataset apache-spark-sql java parquet

I created a parquet-structure from a csv file using spark: I’m reading the parquet-structure and I’m trying to transform the data in a dataset: Unfortunately I get a data type mismatch error. Do I have to explicitly assign data types? 17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT *, md5(station_id) as hashkey FROM tmpview Exception in thread “main” org.apache.spark.sql.AnalysisException: cannot resolve