I’m currently working through this tutorial on Stream processing in Apache Flink and am a little confused on how the TimeCharacteristics of a StreamEnvironment effect the order of the data values in the stream and in respect to which time an onTimer function of a ProcessFunction is called.
In the tutorial, they set the characteristics to EventTime
, since we want to compare the start & end events based on the time they store and not the time they are received in the stream.
Now in the reference solution they set a timerService to fire 2 hours after an events timestamp for each key.
What really confuses me is when this timer actually fires during runtime. Possible explanation I came up with:
Setting the TimeCharacteristics
to EventTime
makes the stream to process the entries ordered by their event timestamp and this way the timer can be fired for each rideId, when an event arrives with a timestamp > rideId.timeStamp + 2 hours
(2 hours coming from exercise context).
But with this explanation a startEvent of a Taxi ride would always be processed before an endEvent (I’m assuming that a ride can’t end before it started), and we wouldn’t have to check if a matching EndEvent has already arrived like they do in the processElement function.
In the documentation of ProcessFunction
they state that the timer is called
“When a timer’s particular time is reached”
but since we have a (potentially infinite) stream of data and we don’t care when the data point arrives but only when it happened, how can we be sure that there will not arrive a matching data point for a startEvent somewhere in the future that would trigger the criteria with 2 hours stated in the exercise?
If someone could link me an explanation of this or correct me where I’m wrong that would be highly appreciated.
Advertisement
Answer
An event-time timer fires when Flink is satisfied that all events with timestamps earlier than the time in the timer have already been processed. This is done by waiting for the current watermark to reach the time specified in the timer.
When working with event-time, events are usually processed out-of-order, and this is the case in the exercises you are working with. In general, watermarks are used to mark the passage of event-time — a watermark is characterized by a timestamp t, and indicates that the stream is now complete up through time t (meaning that all earlier events have already been processed). In the training exercises, the TaxiRideSource is parameterized according to how much out-of-orderness you want to have, and the TaxiRideSource takes care to emit appropriately delayed watermarks.
You can read more about event time and watermarks in the Flink documentation.