apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?

Question

I am a newbie with apache flink. I have an unbound data stream in my input (fed into flink 0.10 via kakfa). I want to get the 1st occurence of each primary key (the primary key is the contract_num and the event_dt). These &#8220;duplicates&#8221; occur nearly immediately after each other. The source system ca…

Accepted Answer

Filtering duplicates over an infinite stream will eventually fail if your key space is larger than your available storage space. The reason is that you have to store the already seen keys somewhere to filter out the duplicates. Thus, it would be good to define a time window after which you can purge the current set of seen keys.If you’re aware of this problem but want to try it anyway, you can do it by applying a stateful flatMap operation after the keyBy call. The stateful mapper uses Flink’s state abstraction to store whether it has already seen an element with this key or not. That way, you will also benefit from Flink’s fault tolerance mechanism because your state will be automatically checkpointed.A Flink program doing your job could look likepublic static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream> input = env.fromElements(Tuple3.of("foo", new Date(1000), "bar"), Tuple3.of("foo", new Date(1000), "foobar")); input.keyBy(0, 1).flatMap(new DuplicateFilter()).print(); env.execute("Test");}where the implementation of DuplicateFilter depends on the version of Flink.Version >= 1.0 implementationpublic static class DuplicateFilter extends RichFlatMapFunction, Tuple3> { static final ValueStateDescriptor descriptor = new ValueStateDescriptor<>("seen", Boolean.class, false); private ValueState operatorState; @Override public void open(Configuration configuration) { operatorState = this.getRuntimeContext().getState(descriptor); } @Override public void flatMap(Tuple3 value, Collector> out) throws Exception { if (!operatorState.value()) { // we haven't seen the element yet out.collect(value); // set operator state to true so that we don't emit elements with this key again operatorState.update(true); } }}Version 0.10 implementationpublic static class DuplicateFilter extends RichFlatMapFunction, Tuple3> { private OperatorState operatorState; @Override public void open(Configuration configuration) { operatorState = this.getRuntimeContext().getKeyValueState("seen", Boolean.class, false); } @Override public void flatMap(Tuple3 value, Collector> out) throws Exception { if (!operatorState.value()) { // we haven't seen the element yet out.collect(value); operatorState.update(true); } }}Update: Using a tumbling time windowinput.keyBy(0, 1).timeWindow(Time.seconds(1)).apply(new WindowFunction>, Tuple3, Tuple, TimeWindow>() { @Override public void apply(Tuple tuple, TimeWindow window, Iterable> input, Collector> out) throws Exception { out.collect(input.iterator().next()); }})

Advertisement

Answer

Version >= 1.0 implementation

Version 0.10 implementation

Update: Using a tumbling time window