If you haven't already done so, you'll need to first setup your Flink development environment. See How to do the Labs for an overall introduction to these exercises.
The task of the "Taxi Ride Cleansing" exercise is to cleanse a stream of TaxiRide events by removing events that start or end outside of New York City.
The GeoUtils
utility class provides a static method isInNYC(float lon, float lat)
to check if a location is within the NYC area.
This exercise is based on a stream of TaxiRide
events, as described in Using the Taxi Data Streams.
The result of the exercise should be a DataStream<TaxiRide>
that only contains events of taxi rides which both start and end in the New York City area as defined by GeoUtils.isInNYC()
.
The resulting stream should be printed to standard out.
ℹ️ Rather than following the links to the sources in this section, you'll do better to find these classes in the flink-training project in your IDE. Both IntelliJ and Eclipse have ways to make it easy to search for and navigate to classes and files. For IntelliJ, see the help on searching, or simply press the Shift key twice and then continue typing something like
RideCleansing
and then select from the choices that popup.
This exercise uses these classes:
- Java:
org.apache.flink.training.exercises.ridecleansing.RideCleansingExercise
- Scala:
org.apache.flink.training.exercises.ridecleansing.scala.RideCleansingExercise
You will find the tests for this exercise in
- Java:
org.apache.flink.training.exercises.ridecleansing.RideCleansingIntegrationTest
- Java:
org.apache.flink.training.exercises.ridecleansing.RideCleansingUnitTest
- Scala:
org.apache.flink.training.exercises.ridecleansing.scala.RideCleansingIntegrationTest
- Scala:
org.apache.flink.training.exercises.ridecleansing.scala.RideCleansingUnitTest
Like most of these exercises, at some point the RideCleansingExercise
class throws an exception
throw new MissingSolutionException();
Once you remove this line, the test will fail until you provide a working solution. You might want to first try something clearly broken, such as
return false;
in order to verify that the test does indeed fail when you make a mistake, and then work on implementing a proper solution.
Filtering Events
Flink's DataStream API features a DataStream.filter(FilterFunction)
transformation to filter events from a data stream. The GeoUtils.isInNYC()
function can be called within a FilterFunction
to check if a location is in the New York City area. Your filter function should check both the starting and ending locations of each ride.
Reference solutions are available in this project:
- Java:
org.apache.flink.training.solutions.ridecleansing.RideCleansingSolution
- Scala:
org.apache.flink.training.solutions.ridecleansing.scala.RideCleansingSolution