Skip to content
Advertisement

Apache Beam Resampling of columns based on date

I am using ApacheBeam to process data and trying to achieve the following.

  1. read the data from CSV file. (Completed )
  2. Group the records based on Customer ID (Completed)
  3. Resample the data based on month and calculate the sum for that particular month.

Detailed Explanation:

I have a CSV file as shown below.

customerId date amount
BS:89481 11/14/2012 124
BS:89480 11/14/2012 234
BS:89481 11/10/2012 189
BS:89480 11/02/2012 987
BS:89481 09/14/2012 784
BS:89480 11/14/2012 056

Intermediate stage: Grouping it by customerId and sorting it by date

customerId date amount
BS:89481 09/14/2012 784
BS:89481 11/10/2012 189
BS:89481 11/14/2012 124
BS:89480 11/02/2012 987
BS:89480 11/14/2012 234
BS:89480 11/14/2012 056

Expected output on( resampling) Here, we calculate the sum of all the amounts for that particular month for the individual customers. Eg: the customer BS:89481 has two spends for the month of November, so we have calculated the sum for that month (124 + 189).

customerId date amount
BS:89481 09/30/2012 784
BS:89481 11/30/2012 313
BS:89480 11/02/2012 1277

I was able to complete step1 and step2, not sure how to implement step3.

Schema schema = new Schema.Parser().parse(schemaFile);

JavaScript

Update:

Schema Transform:

JavaScript

CSV File

customerId date amount
BS:89481 11/14/2012 124
BS:89480 11/14/2012 234
BS:89481 11/10/2012 189
BS:89480 11/02/2012 987
BS:89481 09/14/2012 784
BS:89480 11/14/2012 056
JavaScript

Code:

JavaScript

Corrected Code:

JavaScript

The output that I am getting is

enter image description here

TheExpected Output

customerId date amount
BS:89481 09/30/2012 784
BS:89481 11/30/2012 313
BS:89480 11/30/2012 1277

Advertisement

Answer

Since you already have a PCollection of Rows then you can use Schema-aware PTransforms. For your case, you may want to use “Grouping aggregations” and it can be something like this:

JavaScript

It will group the rows by customer id and date, and then count a total amount for them per day. If you want to count it per month, then you need to create a new column (or modify a current one) with just a date in month format and group it by id and this column.

Also, it can be useful to use Selected.flattenedSchema to flatten an output schema. Beam Schema API allows to work with Schema-aware PCollections in a very simple and effective way.

Another option could be to implement your GroupBy/Aggregate logic manually with KVs but it’s more complicated and error-prone since it will require much more boilerplate code.

Advertisement