I’m attempting to upgrade the Apache Beam libraries from v2.19.0 to v2.37.0 (Java 8 & Maven), but have run into an issue with a breaking change that I’d appreciate some support with. Sorry this is quite a long one, I wanted to capture as much context as I could, but please shout if there’s anything you’d like to dig into.
Tag: apache-beam
Apache Beam Split to Multiple Pipeline Output
I am subscribing from one topic and contains different event types and they pass in with different attributes. After I read the element, based on their attribute, I need to move them to different places. This is the sample code look like: Basically I read an event and filter them on their attribute and write the file. The job failed
How to extract information from PCollection after a join in apache beam?
I have two example streams of data on which I perform innerJoin. I would like to extend this piece of example join code and add some logic after the join occurs I would like to just print the ad name and num clicks after the join using a DoFcn like this: Any ideas on how to extract this info from
Add document to Firestore from Beam with auto generated ID
I would like to use Apache Beam Java with the recently published Firestore connector to add new documents to a Firestore collection. While I thought that this should be a relatively easy task, the need for creating com.google.firestore.v1.Document objects seem to make things a bit more difficult. I was using this blog post on Using Firestore and Apache Beam for
Apache Beam BigqueryIO.Write getSuccessfulInserts not working
We are creating a simple apache beam pipeline which inserts data into a bigquery table and We are trying to get the tableRows which have been successfully Inserted into the table and tableRows which are errored, the code is as shown in the screenshot According to the following documentation: https://beam.apache.org/releases/javadoc/2.33.0/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html BigQueryIO.writeTableRows() returns a WriteResult object which has getSuccessfulInserts() which will
Best practice to pass large pipeline option in apache beam
We have a use case where we want in to pass hundred lines of json spec to our apache beam pipeline.? One straight forward way is to create custom pipeline option as mentioned below. Is there any other way where we can pass the input as file? I want to deploy the pipeline in Google dataflow engine. Even If I
How to append new rows or perform union on tow PCollection
In the following CSV, I need to append new row values for it. ID date balance 01 31/01/2021 100 01 28/02/2021 200 01 31/03/2021 200 01 30/04/2021 200 01 31/05/2021 500 01 30/06/2021 600 Expected output: ID date balance 01 31/01/2021 100 01 28/02/2021 200 01 31/03/2021 200 01 30/04/2021 200 01 31/05/2021 500 01 30/06/2021 600 01 30/07/2021 999
How do I set the coder for a PCollection<List> in Apache Beam?
I’m teaching myself Apache Beam, specifically for using in parsing JSON. I was able to create a simple example that parsed JSON to a POJO and POJO to CSV. It required that I use .setCoder() for my simple POJO class. The problem Now I am trying to skip the POJO step of parsing using some custom transforms. My pipeline looks
Apache Beam Resampling of columns based on date
I am using ApacheBeam to process data and trying to achieve the following. read the data from CSV file. (Completed ) Group the records based on Customer ID (Completed) Resample the data based on month and calculate the sum for that particular month. Detailed Explanation: I have a CSV file as shown below. customerId date amount BS:89481 11/14/2012 124 BS:89480
Beam PAssert messes up the Row
I am exploring testing with Beam and encountered a weird problem. My driver program works as expected, but its test is failing with an error like this: And here is my PAssert code: On the last step of my pipeline, I log the element in question. This is the expected result. When I debugged the test, the problem boiled down