Kubernetes Horizontal Pod Autoscaler

One of the basic Kubernetes features is the ability to manually horizontally scale pods up and down by increasing or decreasing the desired replica count field on the pod controller. Automatic scaling is built in top of that. Horizontal pod autoscaling is the automatic increase or decrease the number of pod replicas managed by a higher level controller that supports scaling, such as deployments, replica sets and stateful sets. By increasing the number of pods, the average value per pod for the autoscaling target value should come down. Metrics appropriate for autoscaling are those metrics whose average value decreases linearly with the increase of pod replica count.

How it Works

The scaling is performed by a horizontal pod controller and it is controlled by a HorizontalPodAutoscaler Kubernetes API resource, which enables and configures the horizontal pod autoscaler. For an horizontal pod autoscaler to work correctly, a source of metrics, in particular resource metrics, must be deployed. The simplest source of resource metrics is the metrics server.

The controller periodically reads the appropriate metrics API to obtain metrics for the pods it monitors. The set of pods to watch is provided by the higher level pod controller the autoscaler is associated with. The autoscaler calculates the number of replicas required to meet the target metric configured on the HorizontalPodAutoscaler resource, as described in the Autoscaling Algorithm section, below. If there is a mismatch, the controller adjusts the "replicas" field of the scaled resource through the Scale sub-resource. The target pod controller is not aware of the autoscaler. In what it is concerned, anybody, including the autoscaler, may update the replica count.

Each component gets metrics from its source periodically. The end effect is that it takes a while for metrics to be propagated and a rescaling action to be performed.

Metrics and Autoscaling.png

Horizontal Pod Autoscaler Controller

The horizontal pod autoscaler controller is part of the cluster's controller manager process.

Autoscaling Algorithm

Autoscaling can be performed for one metric or multiple metrics. If multiple metrics are involved, the controller computes the required number of pods for each metric, and picks the larger value.

For a specific metric, the goal of the algorithm is to compute the number of replicas that will bring the average value of the metric as close to the target value as possible. The input is a set of metrics, one for each pod, and the output is an integer, which represents the target number of pod replicas.

The algorithm also needs to make sure that the autoscaler does not thrash around when the metric value is unstable and changes rapidly.

The autoscaler will at most double the number of replicas in a single operation, if more that two current replicas exist. If only one replica exists, there is no such limitation. The autoscaler also has a limit of how soon a subsequent autoscale operation can occur after the previous one. A scale-up will occur only if no rescaling event occurred in the last three minutes. A scale-down is performed every five minutes.

Horizontal Pod Autoscaler Resource

The HorizontalPodAutoscaler Kubernetes API resource is deployed as any other Kubernetes resource by posting a manifest to the API server.

The HorizontalPodAutoscaler minReplicas field cannot be set to zero, so the autoscaler will never scale down to zero pods (idling and un-idling).

The resource can be changed after deployment. The changes will be detected and acted upon.

HorizontalPodAutoscaler Manifest

CPU-based Scaling

The CPU-based scaling algorithm uses the guaranteed CPU amount (the CPU requests) when determining the CPU utilization of the pod. This means the pod needs to have the CPU requests set, either directly or through a LimitRange object, to be eligible for autoscaling.

CPU average utilization is the container's actual CPU usage divided by its CPU request value.

Scale Deployments instead of ReplicaSets. This way, the desired replica count is preserved across application updates, because a Deployment creates a new ReplicaSet for each version.

Memory-based Scaling

Memory-based autoscaling is much more problematic than CPU-based autoscaling. The main reason is because after scaling up, the old pods would somehow need to be forced to release memory. This needs to be done by the application itself, it cannot be done by the system. All the system could do is to kill and restart the application, hoping that it would use less memory than before.

Custom Metrics-based Scaling

