Dataflow writing a pCollection of GenericRecords to Parquet files

Question

In apache beam step I have a PCollection of KV>>>. I want to write all the records in the iterable to the same parquet file. My code snippet is given below now I want to write all the Records in the Iterable in the same parquet file(derive the file name by the key of KV). Answer I found

Accepted Answer

I found out the solution to the problem.at the step – apply(GroupByKey.create()) //PCollection>>>>I will apply another transform that will return only the Iterable as the output pCollection.`.apply(ParDo.of(new GetIterable())) //PCollection>>where key is the name of the file I have to write to.then remaining snippet is.apply(Flatten.iterables()) .apply( FileIO.>writeDynamic() .by((SerializableFunction, String>) KV::getKey) .via( Contextful.fn( (SerializableFunction, GenericRecord>) KV::getValue ), ParquetIO.sink(schema) .withCompressionCodec(CompressionCodecName.SNAPPY) ) .withTempDirectory("/tmp/temp-beam") .to(options.getGCSBucketUrl()) .withNumShards(1) .withDestinationCoder(StringUtf8Coder.of()) )

Advertisement

Answer