Skip to content
Advertisement

What are the other options to handle skew data in Flink?

I am studying data skew processing in Flink and how I can change the low-level control of physical partition in order to have an even processing of tuples. I have created synthetic skewed data sources and I aim to process (aggregate) them over a window. Here is the complete code.

JavaScript

According to the Flink dashboard I could not see too much difference among .shuffle(), .rescale(), and .rebalance(). Even though the documentation says rebalance() transformation is more suitable for data skew.

After that I tried to use .partitionCustom(partitioner, "someKey"). However, for my surprise, I could not use setParallelism(4) on the window operation. The documentation says

Note: This operation is inherently non-parallel since all elements have to pass through the same operator instance.

I did not understand why. If I am allowed to do partitionCustom, why can’t I use parallelism after that? Here is the complete code.

JavaScript

Thanks, Felipe

Advertisement

Answer

I got an answer from FLink-user-mail list. Basically using keyBy() after rebalance() is killing all effect that rebalance() is trying to do. The first (ad-hoc) solution that I found is to create a composite key that cares about the skewed key.

JavaScript

I use it on the map function before use keyBy().

JavaScript

here is my complete solution.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement