Skip to content
Advertisement

A parquet file of a dataset having a String field containing leading zeroes returns that field without leading zeroes, if it is paritionned by it

I have a Dataset gathering informations about French cities,

JavaScript

and the field that is troubling me is the department one (CodeDepartement).

When the Dataset isn’t partitioned by this String field codeDepartement: everything is working well


When that function runs, if I don’t attempt to partition the dataset (the required statements for partitioning are commented here), everything goes fine:

The content displayed on the console is this one:

JavaScript

The sub functions saveToStore are called, they are just doing this:

JavaScript

A parquet file is created (a folder containing 200 files, like part-00000-522c936f-7b72-4ed8-bab9-9d4acee6bc7c-c000.snappy.parquet, not partitioned), and if I load the resulting parquet file with Zeppelin, I receive that content :

JavaScript
JavaScript

The leading zeroes of the codeDepartement fields are here, and it’s nofmal, this field is a string, proved by some 2A and to 2B that are code for Corse (North and South) department.

I notice that in the schema shown, codeDepartment is in fourth position.

When the dataset is partitioned with codeDepartement: leading zeroes are lost when the parquet file is loaded:


if I activate in my CogDataset source file the partitionning by codeDepartement of the Dataset, by uncommenting the lines:

JavaScript

The content of the dump “X07” is the same, except that codeDepartment are ordered, and retain their leading zeroes, the parquet file has now subfolders like codeDepartement=02, (and 02 has its leading zero, so it’s promising), but when I load that parquet file with Zeppelin, things are going wrong:

JavaScript

Leading zeroes of codeDepartement are lost,
while 2A and 2B department codes are still here, showing that this field is still a string.

I notice that the codeDepartment given by Parquet is at the last position of the schema, as if it recreated that field itself (?).

Do you have an idea of what is affecting me?
It looks like I’m missing some options that I should set before storing to Parquet or reloading for it my content?

I’m using Spark 3.2.0.

Advertisement

Answer

I found out the answer. The problem isn’t the parquet file itself, but the fact that these statements:

JavaScript

even if they are displaying the good dataset schema, don’t take this schema really into account, and try somewhat to infer it from data, while reading, as long as possible (?!),
therefore, considering my codeDepartement field as numeric until stumbling upon the values 2A and 2B forcing it to switch its field type to string.

I wonder why is parquet file loading working that way. And if my explanation is the good one.

But the correct way to ensure the schema of the parquet file is taken into account is to explicitly ask to extract it first from that parquet file, and then ask the load function to use it:

JavaScript

and then the behavior of parquet file reading becomes normal:

JavaScript

Maybe there’s a simpler way to achieve that, if you know it, I am happy to learn it because this is a bit burdensome.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement