Tutorial

11 min read

Flink on Kubernetes - how and why?

flink-on-kubernetes-how-why-getindata-albert-big-data-blog-macrovector

Flink is an open-source stream processing framework that supports both batch processing and data streaming programs. Streaming happens as data flows through the system with no compulsory time limitations in output. Flink is used in a lot of projects, and we can see it in action in rule-based alerting, anomaly detection systems, web applications with personalized content or personalization of the offer in e-commerce. It can be integrated with the Hadoop ecosystem and Kubernetes, or it can run stand-alone.

Kubernetes has become the major player in the containerization field. It provides a lot of advantages and new challenges - we can call it the next step of IT evolution. Its maturity and main features allow more and more services to become available and to be deployed directly on Kubernetes. Apache Flink is a great example of such a service. We can find multiple ways to install and run it directly on Kubernetes which provide better scalability, easier implementation of Continuous Integration and Continuous Deployment, easier version upgrades and container reuse.

The most important difference between the solutions mentioned below is that they include different operators. Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators are used due to their flexibility (developers can create their own objects and simplify deployment) and offers the possibility to add custom logic, however operator preparation and validation is not simple - it requires a lot of work.

Note: Apache Flink by default exposes its own web UI with a description of the job, metrics, diagram of application, and information about TaskManagers.

Common aspects of Flink in Kubernetes

Let’s start with monitoring. We can set up the exporter, like the Prometheus exporter, add annotations to our pods and start scraping metrics from all the pods on which we run Flink jobs. It’s quite simple when using Prometheus and Kubernetes service discovery feature where we can define target annotations.

The second thing is High Availability. We need to start by connecting Flink jobs to object storage or HDFS in high availability in which we store data and Flink can save its checkpoints and savepoint. Moreover, Flink requires additional service to run two JobManagers to provide a high availability setup - here we can go with connecting them to Zookeeper or using the Kubernetes service to manage the active Flink instance.

Flink K8S operator

GitHub repository:
https://github.com/lyft/flinkk8soperator

lyft-flink-on-kubernetes-getindata-big-data

This service is developed by Lyft and is a great example of software that is constantly being improved by the community. The operator acts as a control plane to manage the complete deployment lifecycle of the application. It is worth mentioning that the last version of this operator was released at the end of April 2020 with release v0.5.0. There is also a dedicated Slack workspace for the users. There is an annotation in the README file that the project is still in beta phase.

Installation and the upgrade process is quite simple. We can go with Kubernetes manifests or Helm chart - if we want to change something then we can go directly to GitHub to the fork repository and do what we want.

The CICD process is quite simple. We can use Kubernetes manifests or Helm charts with all updates and configuration there. Here, we use custom definitions managed by the operator, like the JobManager Deployment and Service, or TaskManager. If you have experience with deploying applications to Kubernetes, you'll know it is a smooth process. We can run additional checks to validate that our job has started successfully by calling the Flink job directly. This operator delivers blue green deployment that is a valuable feature. Blue-green deployment is a technique that reduces downtime and risk by running two identical production environments called Blue and Green. At any time, only one of the environments is live, with the live environment serving all production traffic - in this case we replace the old job with the new one when both are running and only one of them processes data.

The speed of submitting new jobs is not the best and it depends mostly on the amount of available resources - blue green deployment can take several minutes while some solutions described here are much faster.

Any configuration related to the job is managed in the same way, like in the YARN cluster, and it can be achieved by using templates with ConfigMaps stored in the repository. There are no problems with connecting Flink to object storage, kerberized components like Kafka or HDFS.

Advantages:

Blue-Green deployment
nice working custom objects on Kubernetes
open-source code available on GitHub
lightweight solution
simple addition of sidecar containers

Disadvantages:

no central management for jobs
no available SQL editor
we can’t run session clusters
starting a job may take some time - still in beta

Kubernetes Operator for Apache Flink

GitHub repository:
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator

gcp-flink-kubernetes-getindata-big-data-cloud

It is available in the Google Cloud repository but it’s worth mentioning that it’s not officially supported by the company from Mountain View. Talking about Operator, the installation of it is pretty straightforward and we can use Kubernetes manifests or Helm charts to deploy it in our environment.

It creates a custom resource definition FlinkCluster and runs a controller Pod to keep watching custom resources. Once a FlinkCluster custom resource is created and detected by the controller, the controller creates the underlying Kubernetes resource (e.g., JobManager Pod) based on the spec of the custom resource.

CICD pipelines are simple to build as we describe our Flink job within the Helm chart or Kubernetes manifest, so it's quite a typical setup for an application running in a containerized environment. Of course it’s important to check if a job is up by calling the Kubernetes API and getting the status of Flink’s pods and its object.

The latest release, version number 0.2.0, was added in September 2020 and it’s quite an early beta version. It provides the possibility to restart failed jobs from the last savepoint and we can also set up batch scheduling for JobManager and TaskManager pods.

Advantages:

open-source code available on GitHub
lightweight solution
simple adding of sidecar containers
job and session cluster available
support for GCP services (Google Cloud Storage connector, IAM service accounts)
support for Python Apache Beam

Disadvantages:

no central management for jobs
no SQL editor available
beta release 0.2.0 - not the most mature solution

Ververica Platform

Ververica is a platform created by the team behind Apache Flink, so we can be sure that we are dealing with a product from an experienced company that understands Flink and how it works under the hood. One of the most unique features of Ververica Platform is a web dashboard in which the user can create deployment targets (add information about the name of target group and assign it to the chosen namespace), default configuration of Flink job, session clusters on which we can run Flink SQL queries or jobs; managing the lifecycle of our application and, moreover, there is quite a powerful REST API to manage job configuration, creating new jobs. Ververica applies new enhancements based on user feedback.

ververica-platform-getindata-flink-on-kubernetes

It is also a useful platform for running Flink SQL queries directly from Ververica and taking a look into the results from the UI instead of the terminal.

The CICD process requires using the REST API to which we send requests with configuration in YAML or JSON format so it can be easily implemented into any CICD tool like Gitlab CI, Jenkins or GitHub actions.

Installation of the Ververica Platform comes with its own SQLite database (it can be replaced with PostgreSQL, MySQL in the case of any version except the Community Edition) in which it stores all its data. It also simplifies configuration of external blob storage that can be used as the artifactory for Flink JARs as well as the place to store save points and checkpoints of jobs. It works with S3-compatible object storage as well as with kerberized HDFS.

The Ververica Platform is available as a free Community Edition which comes with a limitation of one namespace for jobs and no authentication; Startup Edition and Enterprise Edition. It’s worth mentioning that the authors are constantly improving their product and we can see frequent updates.

Advantages:

simple installation process
powerful REST API that is useful for CICD pipelines
management of Flink jobs (configuration, state)
Flink SQL editor
Autorestart mechanism directly within the Ververica Platform (not only in Flink jobs

Disadvantages:

limitations in the Community Edition, i.e. limited number of active namespaces and no authentication
using custom Flink images requires multiple changes in the Dockerfile and utilized Flink shell scripts

Native Kubernetes from official Apache Flink

Documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html

We can’t forget about the possibilities that are already available directly within Apache Flink. Since the release of 1.12.0 Flink has improved all aspects of running it in Kubernetes. It doesn’t require the addition of any CustomObjects or additional applications for managing Flink.

We have the following deployments in this case:

A Flink Session cluster is executed as a long-running Kubernetes Deployment. You can run multiple Flink jobs in a Session cluster. Each job needs to be submitted to the cluster after the cluster has been deployed.
A Flink Application cluster is a dedicated cluster which runs a single application

Native Kubernetes doesn’t introduce any operator and the installation is quite simple. Moreover, we are sure that it will be developed with adding enhancements to the service constantly as it is supported by a wide group of contributors of Apache Flink. Currently it’s really limited but in some cases it may be enough.

CICD process is the same as in any other application running in Kubernetes - we deploy everything straight from the CICD tool and we need to verify if the application is up or if it has any issues.

Advantages:

no installation required
simple adding of sidecar containers

Disadvantages:

no custom objects
no support for per-job clusters
limited features with no restart mechanism excluding Flink job directly.

Several ways to run Flink on Kubernetes

getindata-flink-on-kubernetes-big-data

Based on our experience, we can recommend using Ververica Platform as the reliable solution for most use cases. It provides a fast way to run the Flink job - restarting the job with the new cluster is quite fast in comparison to, for example, Flink based on Lyft Operator and we can find some useful features like Flink SQL editor.

Surely, the choice of a perfect Flink operator depends on the exact use case, technical requirements and number of jobs. All the presented operators come from strong players in the Big Data market.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems based on Apache Flink in multiple configurations. Do not hesitate to contact us, our team will be happy to discuss your real-time big data streaming project.

big data

kubernetes

apache flink

flink

getindata

CI/CD

Last updated: 23 February 2021

Written by

Albert Lewandowski

Big Data DevOps Engineer

Like this post?
Spread the word

Want more? Check our articles

Tutorial

Apache NiFi and Apache NiFi Registry on Kubernetes

Apache NiFi is a popular, big data processing engine with graphical Web UI that provides non-programmers the ability to swiftly and codelessly create…

Tutorial

Data online generation for event stream processing

In a lot of business cases that we solve at Getindata when working with our clients, we need to analyze sessions: a series of related events of actors…

Tutorial

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML)

What is BigQuery ML? BQML empowers data analysts to create and execute ML models through existing SQL tools & skills. Thanks to that, data analysts…

modern data stack gcp workflowsobszar roboczy 1 4

Tutorial

GCP Workflows - how can you integrate a lightweight managed orchestrator in Modern Data Stack? GID Modern Data Platform

Modern Data Stack has been around for some time already. Both tools and integration patterns have become more mature and battle tested. We shared our…

Tutorial

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

Machine learning is becoming increasingly popular in many industries, from finance to marketing to healthcare. But let's face it, that doesn't mean ML…

5 main data-related trends to be covered at Big Data Tech Warsaw 2021 Part II

Trend 4. Larger clouds over the Big Data landscape A decade ago, only a few companies ran their Big Data infrastructure and pipelines in the public…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

Flink on Kubernetes - how and why?

Common aspects of Flink in Kubernetes

Flink K8S operator

Kubernetes Operator for Apache Flink

Ververica Platform

Native Kubernetes from official Apache Flink

Several ways to run Flink on Kubernetes

Like this post?Spread the word

Want more? Check our articles

Apache NiFi and Apache NiFi Registry on Kubernetes

Data online generation for event stream processing

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML)

GCP Workflows - how can you integrate a lightweight managed orchestrator in Modern Data Stack? GID Modern Data Platform

Automated Machine Learning (AutoML) with BigQuery ML. Start Machine Learning easily and validate if ML is worth investing in or not.

5 main data-related trends to be covered at Big Data Tech Warsaw 2021 Part II

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!