Skip to content
Advertisement

Best practice to pass large pipeline option in apache beam

We have a use case where we want in to pass hundred lines of json spec to our apache beam pipeline.? One straight forward way is to create custom pipeline option as mentioned below. Is there any other way where we can pass the input as file?

public interface CustomPipelineOptions extends PipelineOptions {
    @Description("The Json spec")
    String getJsonSpec();
    void setJsonSpec(String jsonSpec);
}

I want to deploy the pipeline in Google dataflow engine. Even If I pass the spec as filepath and read the file contents inside the beam code before starting the pipeline, how do I bundle the spec file part of pipeline. P.S Note, I don’t want to commit the spec file(in resource folder) part of my source code where my beam code is available. It needs to configurable, i.e I want to pass different spec file for different beam pipeline job.

Advertisement

Answer

You can pass the options as a POJO.

public class JsonSpec {
    public String stringArg;
    public int intArg;
}

Then reference in your options

public interface CustomPipelineOptions extends PipelineOptions {
    @Description("The Json spec")
    JsonSpec getJsonSpec();
    void setJsonSpec(JsonSpec jsonSpec);
}

Options will be parsed to the class; I believe by Jackson though not sure.

I am wondering why you want to pass in “hundreds of lines of JSON” as a pipeline option? This doesn’t seem like a very “Beam” way of doing things. Pipeline options should pass configuration; do you really need hundreds of lines of configuration per pipeline run? If you intend to pass data to create a PCollection then better off using TextIO and then processing lines as JSON.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement