Tag: apache-flink

remote flink job with query to Hive on yarn-cluster error:NoClassDefFoundError: org/apache/hadoop/mapred/JobConf

env: HDP: 3.1.5(hadoop: 3.1.1, hive: 3.1.0), Flink: 1.12.2 Java code: Dependency: error 1: try add dependency get another error try to fix conflict about commons-cli:1.3.1 with 1.2: choose 1.3.1 then error 1; choose 1.2 then error 2; add dependency commons-cli 1.4, then error 1. Answer

Filtering duplicates out of an infinite DataStream with windows

apache-flink flink-streaming java

I want to filter out duplicates in Flink from an infinite DataStream. I know the duplicates arise only in a small time window (max 10 seconds). I found a promising approach that is pretty simple here. But it doesn’t work. It uses a keyed DataStream and returns only the first message of every window. This is my window code: MyKeySelector()is

javadoc of SingleOutputStreamOperator#returns(TypeHint typeHint) method

apache-flink flink-streaming java

I am reading the source code of SingleOutputStreamOperator#returns, its javadoc is: It mentions FunctionWithNonInferrableReturnType to show case the necessity of returns method, but I am unable to write such a class that is NonInferrableReturnType. Could you please help write a simple one? Thanks! Answer When the docs says NonInferrableReturnType it means that we can use the type variable <T>, or

Object reuse – mutating same object – in Flink operators

apache-flink flink-streaming java

I was reading the doc here, which gives a use case to reuse the object as given below: So every time instead of creating a new Tuple, it seems to be able to use the same Tuple by using its mutable nature in order to decrease the pressure on GC. Would it be applicable in all operators, where we can

Flink throws NullPointerException when adding salt for the key and window aggregation on some field

apache-flink java

I have a program doing 2 phase aggregation to solve the data skew in my job. And I used a simple ThreadLocalRandom to generate a suffix to my original like : But Flink throws NullPointerException when adding salt for the key I’m doing window aggregation on some field. I found a similar post on the flink-mail-list, and got the reason

Flink ElasticsearchSinkFunction not serializable in non-static method, but serializable in static method

apache-flink flink-streaming java object static

I have a piece of code that only works inside static methods. If I put the code in a static method, then call it from a non-static method, it works. Never heard of anything like this and couldn’t find information online on it. This works: This doesn’t work: (Full) stack trace: The implementation of the provided ElasticsearchSinkFunction is not serializable.

Flink, rule of using ‘object reuse mode’

apache-flink java

Doc says this mode can cause bugs, but it does not tell me the rule of using this mode, in what case it will cause bugs? Let’s say I have a job, source: kafka (byte[] data), flat-map: parse byte[] to Google Protobuf object ‘foo’, create a Tuple2<>(foo.id, foo), and return this tuple2 keyby and process: for each id, put the

Compare 2 Data streams in Flink to retrieve missing data

apache-flink flink-sql flink-streaming java

I have 2 Data streams “dataStream1” and “dataStream2”, both of them of type String, i have created 2 tables out of them I need to get all the data in T1 that don’t exist in T2 in time frame of 5 seconds, like if i have inserted a record in T1, it should exist in T2 during 5 seconds, otherwise

Few kafka partitions are not getting assigned to any flink consumer

apache-flink apache-kafka java

I have a kafka topic with 15 partitions [0-14] and I’m running flink with 5 parallelism. So ideally each parallel flink consumer should consume 3 partitions each. But even after multiple restarts, few of the kafka partitions are not subscribed by any flink workers. From the above logs, it shows that partitions 10 and 13 have been subscribed by 2

How do I time the checkpointing in Apache Flink streaming?

apache-flink flink-streaming java

I am running the Fraud Detector example of Apache Flink with RocksDB as my state backend. I want to know how long does Apache Flink takes to checkpoint the state. My approach is to print time before and after the checkpoint functions. I could not find the function/class or any piece of code that checkpoints the state I tried debugging