Doc says this mode can cause bugs, but it does not tell me the rule of using this mode, in what case it will cause bugs? Let’s say I have a job,
- source: kafka (byte[] data),
- flat-map: parse byte[] to Google Protobuf object ‘foo’, create a Tuple2<>(foo.id, foo), and return this tuple2
- keyby and process: for each id, put the first foo into ValueState, update ValueState if there are multiple object with same id. Emit the first foo(updated) after 10 seconds.
In this case, is it OK to turn on ‘object reuse mode’?
Advertisement
Answer
For the pipeline you have described, yes, object reuse can be safely enabled.
Object reuse only applies to situations where data is forwarded between operator instances within the same task — so in your case, between the source and flatmap. The keyBy forces ser/de and a network shuffle, so object reuse cannot be used between the flatmap and process function. But object reuse would probably also apply between the process function and sink (which I assume is present).
With object reuse enabled, is it NOT safe to
- remember input object references across function calls or
- modify input objects
If you avoid those two points, you may safely
- modify an output object and emit it again
By the way, it would be preferable to implement your deserialization in a DeserializationSchema or KafkaDeserializationSchema, rather than in a flatmap, in which case object reuse would be irrelevant for that part of your pipeline.