The task of the “Mail Count” exercise is to count the number of emails in the archive of the Flink development mailing list for each unique combination of email address and month.
This exercise uses the Mail Data Set which was extracted from the Apache Flink development mailing list archive. The Mail Data Set instructions show how to read the data set in a Flink program using the
The task requires two fields,
Sender. The input data can be read as a
DataSet<Tuple2<String, String>>. When printed, the data set should look similar to this:
(2014-09-26-08:49:58,Fabian Hueske <email@example.com>) (2014-09-12-14:50:38,Aljoscha Krettek <firstname.lastname@example.org>) (2014-09-30-09:16:29,Stephan Ewen <email@example.com>)
The result of the task should be a
DataSet<Tuple3<String, String, Integer>>. The first field specifies the month, the second field the email address, and the third field the number of emails sent to the mailing list by the given email address in the given month. When printed, the data set should looks like:
(2014-09,firstname.lastname@example.org,16) (2014-09,email@example.com,13) (2014-09,firstname.lastname@example.org,24) (2014-10,email@example.com,14) (2014-10,firstname.lastname@example.org,17)
The first line of the example result indicates that 16 mails were sent to the mailing list by
email@example.com in September 2014.
WordCountprogram, which is the standard example to introduce MapReduce. Similar like
WordCount, this task requires two transformation,
WordCountprogram, this exercise requires to group data on two fields (
email-address) instead of a single field. Flink’s
DataSet.groupBy()transformation, accepts multiple grouping keys and treats them as one composite grouping key.
Maptransformation is used for record-at-a-time processing and should be used to extract the relevant information from from the input data, i.e., the month from the timestamp field and the email address from the sender field.
Reference solutions are available at GitHub: