env: HDP: 3.1.5(hadoop: 3.1.1, hive: 3.1.0), Flink: 1.12.2 Java code: Dependency: error 1: try add dependency get another error try to fix conflict about commons-cli:1.3.1 with 1.2: choose 1.3.1 then error 1; choose 1.2 then error 2; add dependency commons-cli 1.4, then error 1. Answer
Tag: apache-flink
Filtering duplicates out of an infinite DataStream with windows
I want to filter out duplicates in Flink from an infinite DataStream. I know the duplicates arise only in a small time window (max 10 seconds). I found a promising approach that is pretty simple here. But it doesn’t work. It uses a keyed DataStream and returns only the first message of every window. This is my window code: MyKeySelector()is
javadoc of SingleOutputStreamOperator#returns(TypeHint typeHint) method
I am reading the source code of SingleOutputStreamOperator#returns, its javadoc is: It mentions FunctionWithNonInferrableReturnType to show case the necessity of returns method, but I am unable to write such a class that is NonInferrableReturnType. Could you please help write a simple one? Thanks! Answer When the docs says NonInferrableReturnType it means that we can use the type variable <T>, or
Object reuse – mutating same object – in Flink operators
I was reading the doc here, which gives a use case to reuse the object as given below: So every time instead of creating a new Tuple, it seems to be able to use the same Tuple by using its mutable nature in order to decrease the pressure on GC. Would it be applicable in all operators, where we can
Flink throws NullPointerException when adding salt for the key and window aggregation on some field
I have a program doing 2 phase aggregation to solve the data skew in my job. And I used a simple ThreadLocalRandom to generate a suffix to my original like : But Flink throws NullPointerException when adding salt for the key I’m doing window aggregation on some field. I found a similar post on the flink-mail-list, and got the reason
Flink ElasticsearchSinkFunction not serializable in non-static method, but serializable in static method
I have a piece of code that only works inside static methods. If I put the code in a static method, then call it from a non-static method, it works. Never heard of anything like this and couldn’t find information online on it. This works: This doesn’t work: (Full) stack trace: The implementation of the provided ElasticsearchSinkFunction is not serializable.
Flink, rule of using ‘object reuse mode’
Doc says this mode can cause bugs, but it does not tell me the rule of using this mode, in what case it will cause bugs? Let’s say I have a job, source: kafka (byte[] data), flat-map: parse byte[] to Google Protobuf object ‘foo’, create a Tuple2<>(foo.id, foo), and return this tuple2 keyby and process: for each id, put the
Compare 2 Data streams in Flink to retrieve missing data
I have 2 Data streams “dataStream1” and “dataStream2”, both of them of type String, i have created 2 tables out of them I need to get all the data in T1 that don’t exist in T2 in time frame of 5 seconds, like if i have inserted a record in T1, it should exist in T2 during 5 seconds, otherwise
Few kafka partitions are not getting assigned to any flink consumer
I have a kafka topic with 15 partitions [0-14] and I’m running flink with 5 parallelism. So ideally each parallel flink consumer should consume 3 partitions each. But even after multiple restarts, few of the kafka partitions are not subscribed by any flink workers. From the above logs, it shows that partitions 10 and 13 have been subscribed by 2
How do I time the checkpointing in Apache Flink streaming?
I am running the Fraud Detector example of Apache Flink with RocksDB as my state backend. I want to know how long does Apache Flink takes to checkpoint the state. My approach is to print time before and after the checkpoint functions. I could not find the function/class or any piece of code that checkpoints the state I tried debugging