Skip to content
Advertisement

Apache Beam: Kafka consumer restarted over and over again

I have this very simple Beam Pipeline that reads records from a Kafka Topic and writes them to a Pulsar Topic:

JavaScript

From my understanding this should create exactly one Kafka Consumer that pushes it’s values down the Pipeline. Now for some reason the Pipeline seems to restart over and over again creating multiple Kafka Consumers and multiple Pulsar Producers.

Here is an excerpt from the logs that show multiple Kafka Consumers being created:

JavaScript

Why are Kafka Consumers being restarted over and over again? Is this the expected behavior?

Advertisement

Answer

The primary purpose of the DirectRunner is local testing of Pipelines. As such, it exhibits behavior, that might be suboptimal performance-wise. One example might be that it purposefully serializes and deserializes data between operators, even though it is not necessary. The reason is to validate that the application code does not modify the input object, which is forbidden in Beam. The reason why there are many consumers being created is another example of this – DirectRunner performs a checkpoint (including many possibly unnecessary steps, like re-creating the consumer), very often – see here.

As such, DirectRunner really should be used only in tests and/or moderate conditions, where performance is not a concern. When performance is a concern, a different runner should be used – either some distributed, or local version of such runner – e.g. local FlinkRunner.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement