Apache Flink provides the FlinkML API to support machine learning. The goal of FlinkML is to create a scalable and distributed system that can handle data of different sizes either it is MB, TB, or more than that. The major challenges which developers face are the glue codes that are resolved in FlinkML by minimizing the glue code.

The following are some of the algorithms supported by Apache FlinkML.

  • Supported Algorithms
  • Supervised Learning
  • Unsupervised Learning
  • Data Preprocessing
  • Recommendation
  • Outlier selection
  • Utilities

Let us see the example of the K-Means clustering algorithm in which a set of data points and a set of K clusters are provided for clustering. Apache Flink provides the JAR file named "KMeans.jar" under the "flink/examples/batch" directory that can be used to run the K-Means clustering.

The FlinkML program uses the default point and centroid data set.

To run the program use the following command.

cloudduggu@ubuntu:~/flink$ ./bin/flink run examples/batch/KMeans.jar --output Print

flinkml program

We can check the status of the FlinkML program from the Apache Flink GUI as well.

flinkml gui