28 May 2020
By Mohammed Abubakar
There is no one “right” path to Kubernetes success; instead, several good paths exist. In this series of blogs we dive into the core areas of Kubernetes: security, efficiency, and reliability. Our goal is to provide you with Kubernetes best practices for adoption and implementation so you can realise long-term value across your entire organisation.
In the first blog of the series, I discussed the security challenges of running Kubernetes at scale and ways to ensure the security of your Clusters through best practices. In the second blog I go over key best practices for maximising the health and efficiency of your kubernetes clusters.
For companies that have not yet adopted Kubernetes, reliability becomes harder and harder to achieve as the business scales. For companies that have adopted Kubernetes but have yet to solicit expert help, achieving reliability is complex due to the skill it takes to optimise the capabilities Kubernetes offers.
Below, I will highlight the following key Kubernetes best practices related to reliability:
Avoid complexity by keeping it simple. Manually maintained DNS entries can be used to point to an application, and DNS hostnames can be hardcoded into application components so they can communicate. However, rather than using traffic routing, instead use service discovery, which is a more streamlined, dynamic solution. Service discovery enables a user or another application to find instances, pods, or containers. Service discovery is required because your application is scaling in and out, and changes are happening at a fast rate.
Teams often use configuration management tools like Puppet to continuously correct state in a virtual machine (VM) that runs an application, which offers some auditability and security. For example, if someone connects to a VM that runs an application and changes a config file, Puppet will change it back. Containers, however, are more ephemeral. If you need to change something about how an application runs, CI/ CD best practices dictate that you should build and then deploy a new container image through your CI pipeline instead of attempting to modify an existing container.
Kubernetes helps improve reliability by making it possible to schedule containers across multiple nodes and multiple availability zones (AZs) in the cloud. Anti-affinity allows you to constrain which nodes in your pod are eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. With node selection, the node must have each of the indicated key-value pairs as labels for the pod to be eligible to run on a node. When you create a Kubernetes deployment, use anti-affinity or node selection to help spread your applications across the Kubernetes cluster for high availability.
Kubernetes HA means having no single point of failure in a Kubernetes component. An example of a component might be a Kubernetes API server or the etcd database where state is stored in Kubernetes. How do you help ensure these components are HA?
Let’s say you are using Kubernetes on premises and you have three master servers with a load balancer that runs on a single machine. While you have multiple masters, your one load balancer is a single point of failure for the Kubernetes API. You need to avoid this.
If a redundant component in your Kubernetes cluster is lost, the cluster keeps operating because K8S best practice is to deploy a number of redundant instances based on the component (etcd odd number 3+, API server 2+, kube-scheduler 2+ for example). If you lose a second component, then what happens? If you have three masters and you lose one, the two remaining masters could get overloaded, contributing to the degradation or potential loss of another master. It’s key to plan the resiliency of your cluster according to the risk your business can tolerate.
Resource requests and limits for CPU and memory are at the heart of what allows the Kubernetes scheduler to do its job well. If a single pod is allowed to consume all of the node CPU and memory, then other pods will be starved for resources. Setting limits on what a pod can consume increases reliability by keeping pods from consuming all of the available resources on a node (this is referred to as the “noisy neighbor problem”).
Autoscaling, in turn, can increase cluster reliability by allowing the cluster to respond to changes in load. Horizontal Pod Autoscaler (HPA) and cluster autoscaling work together to provide a stable cluster by scaling your application pods and cluster nodes.
Reliability first requires good resource requests and limits, and the Cluster Autoscaler will have a hard time doing its job if your resource requests are not set correctly. The Cluster Autoscaler relies on the scheduler to know that a pod won’t fit on the current nodes, and it also relies on the resource request to determine whether adding a new node will allow the pod to run.
Another important facet of cluster reliability involves the concept of “self-healing.” The idea here is to automatically detect issues in the cluster and automatically fix those issues. This concept is built into Kubernetes in the form of liveness and readiness probes.
A liveness probe indicates whether or not the container is running or alive, and it is fundamental to the proper functioning of a Kubernetes cluster. If this probe is moved into a failing state, then Kubernetes will automatically send a signal to kill the pod to which the container belongs. In addition, if each container in the pod does not have a liveness probe, then a faulty or non-functioning pod will continue to run indefinitely, using up valuable resources and possibly causing application errors.
A readiness probe, on the other hand, is used to indicate when a container is ready to serve traffic. If the pod is behind a Kubernetes service, the pod will not be added to the list of available endpoints in that service until all of the containers in that pod are marked as ready. This procedure allows you to keep unhealthy pods from serving any traffic or accepting any requests, thus preventing your application from exposing errors.
Both probes check that the Kubernetes cluster performs on your containers at set intervals. Each probe has two states, pass and fail, along with a threshold for how many times the probe has to fail or succeed before the state is changed. When configured correctly on all of your containers, these two probe types provide the cluster with the ability to “self-heal.” Problems that arise in containers will be automatically detected, and pods will be killed or taken out of service automatically.
Reliability in a Kubernetes environment is synonymous with stability, streamlined development and operations, and a better user experience. In a Kubernetes environment, reliability becomes much easier to achieve with the right configuration. Said another way, it’s easy to configure things incorrectly. Many factors need to be considered when building a stable and reliable Kubernetes cluster, including the possible need for application changes and changes to cluster configuration. Steps include setting resource requests and limits, autoscaling pods using a metric that represents application load, and using liveness and readiness probes.
Reliability becomes much easier to achieve with the right configurations. Consider the place for existing tools and processes as you move into the world of IaC, containers, cloud native apps, and Kubernetes. Also consider managed Kubernetes as a means of taking full advantage of all of the benefits containerised applications offer.
If you are considering Kubernetes to take full advantage of all of the benefits containerised applications offer please contact us here.