Kubernetes Resources, Scheduler and Autoscaler
- 1. Kubernetes Components
- 2. Resources for Containers
- 3. Pod Disruptions
- 4. Scheduling, Preemption and Eviction
- 5. Kubernetes Autoscaling
- 6. References
1. Kubernetes Components
A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
2. Resources for Containers
When you specify a Pod, you can optionally specify how much of each resource a Container needs. The most common resources to specify are CPU and memory (RAM); there are others.
When you specify the resource request for Containers in a Pod, the scheduler uses this information to decide which node to place the Pod on. When you specify a resource limit for a Container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set. The kubelet also reserves at least the request amount of that system resource specifically for that container to use.
2.1. Resource requests and limits
If the node where a Pod is running has enough of a resource available, it’s possible (and allowed) for a container to use more resource than its
request for that resource specifies. However, a container is not allowed to use more than its resource
memory are each a resource type. A resource type has a base unit. CPU represents compute processing and is specified in units of Kubernetes CPUs. Memory is specified in units of bytes.
Huge pages are a Linux-specific feature where the node kernel allocates blocks of memory that are much larger than the default page size.
Each Container of a Pod can specify one or more of the following:
spec.containers.resources.limits.cpu spec.containers.resources.limits.memory spec.containers.resources.limits.hugepages-<size> spec.containers.resources.requests.cpu spec.containers.resources.requests.memory spec.containers.resources.requests.hugepages-<size>
Although requests and limits can only be specified on individual Containers, it is convenient to talk about Pod resource requests and limits. A Pod resource request/limit for a particular resource type is the sum of the resource requests/limits of that type for each Container in the Pod.
Meaning of CPU
Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers and 1 hyperthread on bare-metal Intel processors.
Fractional requests are allowed. When you define a container with
0.5, you are requesting half as much CPU time compared to if you asked for 1.0 CPU. For CPU resource units, the expression
0.1is equivalent to the expression
100m, which can be read as "one hundred millicpu". Some people say "one hundred millicores", and this is understood to mean the same thing. A request with a decimal point, like 0.1, is converted to 100m by the API, and precision finer than 1m is not allowed. For this reason, the form 100m might be preferred.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
CPU is considered a “compressible” resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. This means the CPU will be artificially restricted, giving your app potentially worse performance! However, it won’t be terminated or evicted. You can use a liveness health check to make sure performance has not been impacted.
Meaning of memory
Limits and requests for memory are measured in bytes. You can express memory as a plain integer or as a fixed-point number using one of these suffixes: E, P, T, G, M, k. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
Unlike CPU resources, memory cannot be compressed. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated. If your pod is managed by a Deployment, StatefulSet, DaemonSet, or another type of controller, then the controller spins up a replacement.
2.2. Local ephemeral storage
FEATURE STATE: Kubernetes v1.10 [beta]
Nodes have local ephemeral storage, backed by locally-attached writeable devices or, sometimes, by RAM. "Ephemeral" means that there is no long-term guarantee about durability.
Pods use ephemeral local storage for scratch space, caching, and for logs. The kubelet can provide scratch space to Pods using local ephemeral storage to mount
emptyDir volumes into containers.
The kubelet also uses this kind of storage to hold node-level container logs, container images, and the writable layers of running containers.
You can use ephemeral-storage for managing local ephemeral storage. Each Container of a Pod can specify one or more of the following:
Limits and requests for ephemeral-storage are measured in bytes. You can express storage as a plain integer or as a fixed-point number using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
If the kubelet is managing local ephemeral storage as a resource, then the kubelet measures storage use in:
directories holding node-level logs
writeable container layers
If a Pod is using more ephemeral storage than you allow it to, the kubelet sets an eviction signal that triggers Pod eviction.
2.3. The lifecycle of a Kubernetes Pod
Kubernetes then checks to see if the Node has enough resources to fulfill the resources requests on the Pod’s containers. If it doesn’t, it moves on to the next node.
If none of the Nodes in the system have resources left to fill the requests, then Pods go into a pending state. By using GKE features such as the Node Autoscaler, Kubernetes Engine can automatically detect this state and create more Nodes automatically. If there is excess capacity, the autoscaler can also scale down and remove Nodes to save you money!
But what about limits? As you know, limits can be higher than the requests. What if you have a Node where the sum of all the container Limits is actually higher than the resources available on the machine?
At this point, Kubernetes goes into something called an overcommitted state. Here is where things get interesting. Because CPU can be compressed, Kubernetes will make sure your containers get the CPU they requested and will throttle the rest. Memory cannot be compressed, so Kubernetes needs to start making decisions on what containers to terminate if the Node runs out of memory.
2.4. Quality of Service for Pods
In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority.
For a Pod to be given a QoS class of Guaranteed:
Every Container in the Pod must have a memory limit and a memory request.
For every Container in the Pod, the memory limit must equal the memory request.
Every Container in the Pod must have a CPU limit and a CPU request.
For every Container in the Pod, the CPU limit must equal the CPU request.
These restrictions apply to init containers and app containers equally.
|If a Container specifies its own memory limit, but does not specify a memory request, Kubernetes automatically assigns a memory request that matches the limit. Similarly, if a Container specifies its own CPU limit, but does not specify a CPU request, Kubernetes automatically assigns a CPU request that matches the limit.|
A Pod is given a QoS class of Burstable if:
The Pod does not meet the criteria for QoS class Guaranteed.
At least one Container in the Pod has a memory or CPU request.
For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let’s discuss the semantics of memory management a bit.
BestEffort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.
Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no BestEffort pods exist.
2.5. Node Allocatable
Kubernetes nodes can be scheduled to Capacity. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.
The kubelet exposes a feature named 'Node Allocatable' that helps to reserve compute resources for system daemons.
... Capacity: cpu: 2 ephemeral-storage: 102685624Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 3993764Ki pods: 110 Allocatable: cpu: 1800m ephemeral-storage: 94425355722 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 3686564Ki pods: 110 ... Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1255m (69%) 5110m (283%) memory 685Mi (19%) 5120400Mi (142227%) ephemeral-storage 1Mi (0%) 2Mi (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%)
2.5.1. Kube Reserved
kube-reserved is meant to capture resource reservation for kubernetes system daemons like the
node problem detector, etc. It is not meant to reserve resources for system daemons that are run as pods. kube-reserved is typically a function of
pod density on the nodes.
In addition to
pid may be specified to reserve the specified number of process IDs for kubernetes system daemons.
2.5.2. System Reserved
system-reserved is meant to capture resource reservation for OS system daemons like
udev, etc. system-reserved should reserve memory for the kernel too since
kernel memory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (
user.slice in systemd world).
In addition to
pid may be specified to reserve the specified number of process IDs for OS system daemons.
2.5.3. Eviction Thresholds
Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it. Nodes can go offline temporarily until memory has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet provides out of resource management. Evictions are supported for
3. Pod Disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application. Examples are:
a hardware failure of the physical machine backing the node
cluster administrator deletes VM (instance) by mistake
cloud provider or hypervisor failure makes VM disappear
a kernel panic
the node disappears from the cluster due to cluster network partition
eviction of a pod due to the node being out-of-resources.
Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:
deleting the deployment or other controller that manages the pod
updating a deployment’s pod template causing a restart
directly deleting a pod (e.g. by accident)
Cluster administrator actions include:
Draining a node for repair or upgrade.
Draining a node from a cluster to scale the cluster down
Removing a pod from a node to permit something else to fit on that node.
If none voluntary disruptions are enabled for your cluster, you can skip creating Pod Disruption Budgets.
3.1. Pod disruption budgets
Kubernetes offers features to help you run highly available applications even when you introduce frequent voluntary disruptions.
As an application owner, you can create a
PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API (e.g.
kubectl drain) instead of directly deleting pods or deployments.
PDBs cannot prevent involuntary disruptions from occurring, but they do count against the budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as
StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource.
When a pod is evicted using the eviction API, it is gracefully terminated, honoring the
terminationGracePeriodSeconds setting in its PodSpec.
3.2. Think about how your application reacts to disruptions
Decide how many instances can be down at the same time for a short period due to a voluntary disruption.
Concern: don’t reduce serving capacity by more than 10%.
Solution: use PDB with minAvailable 90% for example.
Single-instance Stateful Application:
Concern: do not terminate this application without talking to me.
Possible Solution 1: Do not use a PDB and tolerate occasional downtime.
Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards.
Multiple-instance Stateful application such as Consul, ZooKeeper, or etcd:
Concern: Do not reduce number of instances below quorum, otherwise writes fail.
Possible Solution 1: set maxUnavailable to 1 (works with varying scale of application).
Possible Solution 2: set minAvailable to quorum-size (e.g. 3 when scale is 5). (Allows more disruptions at once).
Restartable Batch Job:
Concern: Job needs to complete in case of voluntary disruption.
Possible solution: Do not create a PDB. The Job controller will create a replacement pod.
4. Scheduling, Preemption and Eviction
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.
4.1. Taints and Tolerations
Node affinity is a property of Pods that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite — they allow a node to repel a set of pods.
Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
You add a taint to a node using
kubectl taint. For example,
kubectl taint nodes node1 key1=value1:NoSchedule
places a taint on node
node1. The taint has key
value1, and taint effect
NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.
To remove the taint added by the command above, you can run:
kubectl taint nodes node1 key1=value1:NoSchedule-
You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match" the taint created by the
kubectl taint line above, and thus a pod with either toleration would be able to schedule onto node1:
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule"
tolerations: - key: "key1" operator: "Exists" effect: "NoSchedule"
The default value for
A toleration "matches" a taint if the keys are the same and the effects are the same, and:
Exists(in which case no
valueshould be specified), or
Equaland the `value`s are equal.
There are two special cases:
NoExecute taint effect affects pods that are already running on the node as follows
pods that do not tolerate the taint are evicted immediately
pods that tolerate the taint without specifying
tolerationSecondsin their toleration specification remain bound forever
pods that tolerate the taint with a specified
tolerationSecondsremain bound for the specified amount of time
The node controller automatically taints a Node when certain conditions are true. The following taints are built in:
Node is not ready. This corresponds to the NodeCondition
Node is unreachable from the node controller. This corresponds to the NodeCondition
Node has memory pressure.
Node has disk pressure.
Node has PID pressure.
Node’s network is unavailable.
Node is unschedulable.
When the kubelet is started with "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
In case a node is to be evicted, the node controller or the kubelet adds relevant taints with
NoExecute effect. If the fault condition returns to normal the kubelet or node controller can remove the relevant taint(s).
DaemonSet pods are created with
NoExecute tolerations for the following taints with no
This ensures that DaemonSet pods are never evicted due to these problems.
4.2. Pod Priority and Preemption
Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.
To use priority and preemption:
Add one or more PriorityClasses.
Create Pods with
priorityClassNameset to one of the added PriorityClasses.
A PriorityClass is a non-namespaced object that defines a mapping from a priority class name to the integer value of the priority. The
name is specified in the name field of the PriorityClass object’s metadata. The
value is specified in the required value field. The higher the value, the higher the priority. The name of a PriorityClass object must be a valid DNS subdomain name, and it cannot be prefixed with
$ kubectl get pc NAME VALUE GLOBAL-DEFAULT AGE system-cluster-critical 2000000000 false 60d system-node-critical 2000001000 false 60d $ kubectl get pc system-cluster-critical -oyaml apiVersion: scheduling.k8s.io/v1 description: Used for system critical pods that must run in the cluster, but can be moved to another node if necessary. kind: PriorityClass metadata: creationTimestamp: "2021-09-22T09:29:35Z" generation: 1 name: system-cluster-critical resourceVersion: "84" uid: ff8cb5f8-d989-4a68-b902-d3b1ed891f9b preemptionPolicy: PreemptLowerPriority value: 2000000000
kubelet node-pressure eviction does not evict Pods when their usage does not exceed their requests. If a Pod with lower priority is not exceeding its requests, it won’t be evicted. Another Pod with higher priority that exceeds its requests may be evicted.
4.3. Node-pressure Eviction
Node-pressure eviction is the process by which the kubelet proactively terminates pods to reclaim resources on nodes.
The kubelet monitors resources like CPU, memory, disk space, and filesystem inodes on your cluster’s nodes. When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation.
During a node-pressure eviction, the kubelet sets the
PodPhase for the selected pods to
Failed. This terminates the pods.
Node-pressure eviction is not the same as API-initiated eviction (e.g.
The kubelet does not respect your configured
PodDisruptionBudget or the pod’s
terminationGracePeriodSeconds. If you use soft eviction thresholds, the kubelet respects your configured
eviction-max-pod-grace-period. If you use hard eviction thresholds, it uses a
0s grace period for termination.
If the pods are managed by a workload resource (such as StatefulSet or Deployment) that replaces failed pods, the control plane or
kube-controller-manager creates new pods in place of the evicted pods.
|The kubelet attempts to reclaim node-level resources before it terminates end-user pods. For example, it removes unused container images when disk resources are starved.|
Eviction signals are the current state of a particular resource at a specific point in time. Kubelet uses eviction signals to make eviction decisions by comparing the signals to eviction thresholds, which are the minimum amount of the resource that should be available on the node.
Kubelet uses the following eviction signals:
Eviction Signal Description
memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
nodefs.available := node.stats.fs.available
nodefs.inodesFree := node.stats.fs.inodesFree
imagefs.available := node.stats.runtime.imagefs.available
imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree
pid.available := node.stats.rlimit.maxpid - node.stats.rlimit.curproc
You can specify custom eviction thresholds for the kubelet to use when it makes eviction decisions.
Eviction thresholds have the form
eviction-signalis the eviction signal to use.
operatoris the relational operator you want, such as
quantityis the eviction threshold amount, such as 1Gi. The value of quantity must match the quantity representation used by Kubernetes. You can use either literal values or percentages (%).
For example, if a node has
10Giof total memory and you want trigger eviction if the available memory falls below
1Gi, you can define the eviction threshold as either
memory.available<1Gi. You cannot use both.
You can configure soft and hard eviction thresholds.
Soft eviction thresholds
A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The kubelet does not evict pods until the grace period is exceeded. The kubelet returns an error on startup if there is no specified grace period.
You can specify both a soft eviction threshold grace period and a maximum allowed pod termination grace period for kubelet to use during evictions. If you specify a maximum allowed grace period and the soft eviction threshold is met, the kubelet uses the lesser of the two grace periods. If you do not specify a maximum allowed grace period, the kubelet kills evicted pods immediately without graceful termination.
You can use the following flags to configure soft eviction thresholds:
eviction-soft: A set of eviction thresholds like
memory.available<1.5Githat can trigger pod eviction if held over the specified grace period.
eviction-soft-grace-period: A set of eviction grace periods like
memory.available=1m30sthat define how long a soft eviction threshold must hold before triggering a Pod eviction.
eviction-max-pod-grace-period: The maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
Hard eviction thresholds
A hard eviction threshold has no grace period. When a hard eviction threshold is met, the kubelet kills pods immediately without graceful termination to reclaim the starved resource.
You can use the
eviction-hardflag to configure a set of hard eviction thresholds like
The kubelet has the following default hard eviction thresholds:
memory.available<100Mi nodefs.available<10% imagefs.available<15% nodefs.inodesFree<5% (Linux nodes)
4.3.1. Pod selection for kubelet eviction
If the kubelet’s attempts to reclaim node-level resources don’t bring the eviction signal below the threshold, the kubelet begins to evict end-user pods.
The kubelet uses the following parameters to determine pod eviction order:
Whether the pod’s resource usage exceeds requests
The pod’s resource usage relative to requests
As a result, kubelet ranks and evicts pods in the following order:
Burstablepods where the usage exceeds requests. These pods are evicted based on their Priority and then by how much their usage level exceeds the request.
Burstablepods where the usage is less than requests are evicted last, based on their Priority.
The kubelet does not use the pod’s QoS class to determine the eviction order. You can use the QoS class to estimate the most likely pod eviction order when reclaiming resources like memory. QoS does not apply to EphemeralStorage requests, so the above scenario will not apply if the node is, for example, under
5. Kubernetes Autoscaling
The foundation of building cost-optimized applications is spreading the cost-saving culture across teams. Beyond moving cost discussions to the beginning of the development process, this approach forces you to better understand the environment that your applications are running in—in this context, the GKE environment.
In order to achieve low cost and application stability, you must correctly set or tune some features and configurations (such as autoscaling, machine types, and region selection). Another important consideration is your workload type because, depending on the workload type and your application’s requirements, you must apply different configurations in order to further lower your costs. Finally, you must monitor your spending and create guardrails so that you can enforce best practices early in your development cycle.
Kubernetes has three scalability tools. Two of these, the Horizontal pod autoscaler (HPA) and the Vertical pod autoscaler (VPA), function on the application abstraction layer. The cluster autoscaler (CA) works on the infrastructure layer.
5.1. Horizontal Pod Autoscaler
Horizontal Pod Autoscaler (HPA) is meant for scaling applications that are running in Pods based on metrics that express load. You can configure either CPU utilization or other custom metrics (for example, requests per second). In short, HPA adds and deletes Pods replicas, and it is best suited for stateless workers that can spin up quickly to react to usage spikes, and shut down gracefully to avoid workload instability.
Even if you guarantee that your application can start up in a matter of seconds, this extra time is required when Cluster Autoscaler adds new nodes to your cluster or when Pods are throttled due to lack of resources.
The following are best practices for enabling HPA in your application:
Size your application correctly by setting appropriate resource requests and limits.
Set your target utilization to reserve a buffer that can handle requests during a spike.
Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.
Set meaningful readiness and liveness probes.
Make sure that your Metrics Server is always up and running.
Inform clients of your application that they must consider implementing exponential retries for handling transient issues.
Make sure your applications are shutting down according to Kubernetes expectations:
Don’t stop accepting new requests right after
Your application must not stop immediately, but instead finish all requests that are in flight and still listen to incoming connections that arrive after the Pod termination begins. It might take a while for Kubernetes to update all kube-proxies and load balancers. If your application terminates before these are updated, some requests might cause errors on the client side.
If your application doesn’t follow the preceding practice, use the
Most programs don’t stop accepting requests right away. However, if you’re using third-party code or are managing a system that you don’t have control over, such as nginx, the preStop hook is a good option for triggering a graceful shutdown without modifying the application. One common strategy is to execute, in the preStop hook, a sleep of a few seconds to postpone the SIGTERM. This gives Kubernetes extra time to finish the Pod deletion process, and reduces connection errors on the client side.
If your application must clean up or has an in-memory state that must be persisted before the process terminates, now is the time to do it. Different programming languages have different ways to catch this signal, so find the right way in your language.
terminationGracePeriodSecondsto fit your application needs.
Some applications need more than the default 30 seconds to finish. In this case, you must specify terminationGracePeriodSeconds. High values might increase time for node upgrades or rollouts, for example. Low values might not allow enough time for Kubernetes to finish the Pod termination process. Either way, we recommend that you set your application’s termination period to less than 10 minutes because Cluster Autoscaler honors it for 10 minutes only.
If your application uses container-native load balancing, start failing your
readiness probewhen you receive a
This action directly signals load balancers to stop forwarding new requests to the backend Pod. Depending on the race between health check configuration and endpoint programming, the backend Pod might be taken out of traffic earlier.
5.2. Vertical Pod Autoscaler
Unlike HPA, which adds and deletes Pod replicas for rapidly reacting to usage spikes, Vertical Pod Autoscaler (VPA) observes Pods over time and gradually finds the optimal CPU and memory resources required by the Pods. Setting the right resources is important for stability and cost efficiency. If your Pod resources are too small, your application can either be throttled or it can fail due to out-of-memory errors. If your resources are too large, you have waste and, therefore, larger bills. VPA is meant for stateless and stateful workloads not handled by HPA or when you don’t know the proper Pod resource requests.
VPA can work in three different modes:
In this mode, also known as recommendation mode, VPA does not apply any change to your Pod. The recommendations are calculated and can be inspected in the VPA object.
VPA assigns resource requests only at Pod creation and never changes them later.
VPA updates CPU and memory requests during the life of a Pod. That means, the Pod is deleted, CPU and memory are adjusted, and then a new Pod is started.
If you plan to use VPA, the best practice is to start with the
Off mode for pulling VPA recommendations. Make sure it’s running for 24 hours, ideally one week or more, before pulling recommendations. Then, only when you feel confident, consider switching to either
Follow these best practices for enabling VPA, either in
Auto mode, in your application:
Don’t use VPA either
Automode if you need to handle sudden spikes in traffic. Use HPA instead.
Make sure your application can grow vertically. Set minimum and maximum container sizes in the VPA objects to avoid the autoscaler making significant changes when your application is not receiving traffic.
Don’t make abrupt changes, such as dropping the Pod’s replicas from 30 to 5 all at once. This kind of change requires a new deployment, new label set, and new VPA object.
Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.
Set meaningful readiness and liveness probes.
Make sure that your Metrics Server is always up and running.
Inform clients of your application that they must consider implementing exponential retries for handling transient issues.
Consider using node auto-provisioning along with VPA so that if a Pod gets large enough to fit into existing machine types, Cluster Autoscaler provisions larger machines to fit the new Pod.
Whether you are considering using
Auto mode, make sure you also follow these practices:
Make sure your application can be restarted while receiving traffic.
Add Pod Disruption Budget (PDB) to control how many Pods can be taken down at the same time.
5.3. Cluster Autoscaler
Cluster Autoscaler (CA) automatically resizes the underlying computer infrastructure. CA provides nodes for Pods that don’t have a place to run in the cluster and removes under-utilized nodes. CA is optimized for the cost of infrastructure. In other words, if there are two or more node types in the cluster, CA chooses the least expensive one that fits the given demand.
Unlike HPA and VPA, CA doesn’t depend on load metrics. Instead, it’s based on scheduling simulation and declared Pod requests. It’s a best practice to enable CA whenever you are using either HPA or VPA. This practice ensures that if your Pod autoscalers determine that you need more capacity, your underlying infrastructure grows accordingly.
As these diagrams show, CA automatically adds and removes compute capacity to handle traffic spikes and save you money when your customers are sleeping. It is a best practice to define Pod Disruption Budget (PDB) for all your applications. It is particularly important at the CA scale-down phase when PDB controls the number of replicas that can be taken down at one time.
Certain Pods cannot be restarted by any autoscaler when they cause some temporary disruption, so the node they run on can’t be deleted. For example, system Pods (such as
kube-dns), and Pods using local storage won’t be restarted. However, you can change this behavior by defining PDBs for these system Pods and by setting
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation for Pods using local storage that are safe for the autoscaler to restart. Moreover, consider running long-lived Pods that can’t be restarted on a separate node pool, so they don’t block scale-down of other nodes. Finally, learn how to analyze CA events in the logs to understand why a particular scaling activity didn’t happen as expected.
The following is a summary of the best practices for enabling Cluster Autoscaler in your cluster:
Use either HPA or VPA to autoscale your workloads.
Make sure you are following the best practices described in the chosen Pod autoscaler.
Size your application correctly by setting appropriate resource requests and limits or use VPA.
Define a PDB for your applications.
Define PDB for system Pods that might block your scale-down. For example, kube-dns. To avoid temporary disruption in your cluster, don’t set PDB for system Pods that have only 1 replica (such as metrics-server).
Run short-lived Pods and Pods that can be restarted in separate node pools, so that long-lived Pods don’t block their scale-down.
Avoid over-provisioning by configuring idle nodes in your cluster. For that, you must know your minimum capacity—for many companies it’s during the night—and set the minimum number of nodes in your node pools to support that capacity.
If you need extra capacity to handle requests during spikes, use pause Pods, which are discussed in (Autoscaler and over-provisioning).
However, as noted in the Horizontal Pod Autoscaler section, scale-ups might take some time due to infrastructure provisioning. To visualize this difference in time and possible scale-up scenarios, consider the following image.
When your cluster has enough room for deploying new Pods, one of the Workload scale-up scenarios is triggered. Meaning, if an existing node never deployed your application, it must download its container images before starting the Pod (scenario 1). However, if the same node must start a new Pod replica of your application, the total scale-up time decreases because no image download is required (scenario 2).
When your cluster doesn’t have enough room for deploying new Pods, one of the Infrastructure and Workload scale-up scenarios is triggered. This means that Cluster Autoscaler must provision new nodes and start the required software before approaching your application (scenario 1). If you use node auto-provisioning, depending on the workload scheduled, new node pools might be required. In this situation, the total scale-up time increases because Cluster Autoscaler has to provision nodes and node pools (scenario 2).
For scenarios where new infrastructure is required, don’t squeeze your cluster too much—meaning, you must over-provision but only for reserving the necessary buffer to handle the expected peak requests during scale-ups.
There are two main strategies for this kind of over-provisioning:
Fine-tune the HPA utilization target. The following equation is a simple and safe way to find a good CPU target:
(1 - buff)/(1 + perc)
buff is a safety buffer that you can set to avoid reaching 100% CPU. This variable is useful because reaching 100% CPU means that the latency of request processing is much higher than usual.
perc is the percentage of traffic growth you expect in two or three minutes.
For example, if you expect a growth of 30% in your requests and you want to avoid reaching 100% of CPU by defining a 10% safety buffer, your formula would look like this:
(1 - 0.1)/(1 + 0.3) = 0.69
Configure pause Pods. There is no way to configure Cluster Autoscaler to spin up nodes upfront. Instead, you can set an HPA utilization target to provide a buffer to help handle spikes in load. However, if you expect large bursts, setting a small HPA utilization target might not be enough or might become too expensive.
An alternative solution for this problem is to use pause Pods. Pause Pods are low-priority deployments that do nothing but reserve room in your cluster. Whenever a high-priority Pod is scheduled, pause Pods get evicted and the high-priority Pod immediately takes their place. The evicted pause Pods are then rescheduled, and if there is no room in the cluster, Cluster Autoscaler spins up new nodes for fitting them. It’s a best practice to have only a single pause Pod per node. For example, if you are using 4 CPU nodes, configure the pause Pods' CPU request with around 3200m.