Skip to content
Advertisement

How do I set the coder for a PCollection<List> in Apache Beam?

I’m teaching myself Apache Beam, specifically for using in parsing JSON. I was able to create a simple example that parsed JSON to a POJO and POJO to CSV. It required that I use .setCoder() for my simple POJO class.

JavaScript

The problem

Now I am trying to skip the POJO step of parsing using some custom transforms. My pipeline looks like this:

JavaScript

This pipeline is supposed to take a heavily nested JSON structure and print each individual path through the tree. I’m getting the same error I did in the POJO example above:

JavaScript

What I tried

So I tried to add a coder in a few different ways:

JavaScript

Results in “Cannot select from parameterized type”. I found another instance of this error generated by a different use case here, but the accepted answer seemed only be applicable to that use case.

So then I started perusing the Beam docs and found ListCoder.of() which has (literally) no description. But it looked promising, so I tried it:

JavaScript

But this takes me back to the initial error of not having manually set a coder.

The question

How do I satisfy this requirement to set a coder for a List<String> object?

Code

The transform that is causing the setCoder error is this one:

JavaScript

Advertisement

Answer

While the error message seems to imply that the list of strings is what needs encoding, it is actually the JsonNode. I just had to read a little further down in the error message, as the opening statement is a bit deceiving as to where the issue is:

JavaScript

Once I discovered this, I solved the problem by extending Beam’s CustomCoder class. This abstract class is nice because you only have to write the code to serialize and deserialize the object:

JavaScript

Hopes this helps some other Beam newbie out there.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement