1. Kubernetes Components

A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.

The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.

The components of a Kubernetes cluster

2. Resources for Containers

When you specify a Pod, you can optionally specify how much of each resource a Container needs. The most common resources to specify are CPU and memory (RAM); there are others.

When you specify the resource request for Containers in a Pod, the scheduler uses this information to decide which node to place the Pod on. When you specify a resource limit for a Container, the kubelet enforces those limits so that the running container is not allowed to use more of that resource than the limit you set. The kubelet also reserves at least the request amount of that system resource specifically for that container to use.

2.1. Resource requests and limits

If the node where a Pod is running has enough of a resource available, it’s possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.

CPU and memory are each a resource type. A resource type has a base unit. CPU represents compute processing and is specified in units of Kubernetes CPUs. Memory is specified in units of bytes. Huge pages are a Linux-specific feature where the node kernel allocates blocks of memory that are much larger than the default page size.

Each Container of a Pod can specify one or more of the following:


Although requests and limits can only be specified on individual Containers, it is convenient to talk about Pod resource requests and limits. A Pod resource request/limit for a particular resource type is the sum of the resource requests/limits of that type for each Container in the Pod.

  • Meaning of CPU

    Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers and 1 hyperthread on bare-metal Intel processors.

    Fractional requests are allowed. When you define a container with spec.containers[].resources.requests.cpu set to 0.5, you are requesting half as much CPU time compared to if you asked for 1.0 CPU. For CPU resource units, the expression 0.1 is equivalent to the expression 100m, which can be read as "one hundred millicpu". Some people say "one hundred millicores", and this is understood to mean the same thing. A request with a decimal point, like 0.1, is converted to 100m by the API, and precision finer than 1m is not allowed. For this reason, the form 100m might be preferred.

    CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

    CPU is considered a “compressible” resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. This means the CPU will be artificially restricted, giving your app potentially worse performance! However, it won’t be terminated or evicted. You can use a liveness health check to make sure performance has not been impacted.

  • Meaning of memory

    Limits and requests for memory are measured in bytes. You can express memory as a plain integer or as a fixed-point number using one of these suffixes: E, P, T, G, M, k. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.

    Unlike CPU resources, memory cannot be compressed. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated. If your pod is managed by a Deployment, StatefulSet, DaemonSet, or another type of controller, then the controller spins up a replacement.

2.2. Local ephemeral storage

FEATURE STATE: Kubernetes v1.10 [beta]

Nodes have local ephemeral storage, backed by locally-attached writeable devices or, sometimes, by RAM. "Ephemeral" means that there is no long-term guarantee about durability.

Pods use ephemeral local storage for scratch space, caching, and for logs. The kubelet can provide scratch space to Pods using local ephemeral storage to mount emptyDir volumes into containers.

The kubelet also uses this kind of storage to hold node-level container logs, container images, and the writable layers of running containers.

You can use ephemeral-storage for managing local ephemeral storage. Each Container of a Pod can specify one or more of the following:


Limits and requests for ephemeral-storage are measured in bytes. You can express storage as a plain integer or as a fixed-point number using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.

If the kubelet is managing local ephemeral storage as a resource, then the kubelet measures storage use in:

  • emptyDir volumes, except tmpfs emptyDir volumes

  • directories holding node-level logs

  • writeable container layers

If a Pod is using more ephemeral storage than you allow it to, the kubelet sets an eviction signal that triggers Pod eviction.

2.3. The lifecycle of a Kubernetes Pod

Kubernetes then checks to see if the Node has enough resources to fulfill the resources requests on the Pod’s containers. If it doesn’t, it moves on to the next node.

If none of the Nodes in the system have resources left to fill the requests, then Pods go into a pending state. By using GKE features such as the Node Autoscaler, Kubernetes Engine can automatically detect this state and create more Nodes automatically. If there is excess capacity, the autoscaler can also scale down and remove Nodes to save you money!

But what about limits? As you know, limits can be higher than the requests. What if you have a Node where the sum of all the container Limits is actually higher than the resources available on the machine?

At this point, Kubernetes goes into something called an overcommitted state. Here is where things get interesting. Because CPU can be compressed, Kubernetes will make sure your containers get the CPU they requested and will throttle the rest. Memory cannot be compressed, so Kubernetes needs to start making decisions on what containers to terminate if the Node runs out of memory.

2.4. Quality of Service for Pods

In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority.

For a Pod to be given a QoS class of Guaranteed:

  • Every Container in the Pod must have a memory limit and a memory request.

  • For every Container in the Pod, the memory limit must equal the memory request.

  • Every Container in the Pod must have a CPU limit and a CPU request.

  • For every Container in the Pod, the CPU limit must equal the CPU request.

These restrictions apply to init containers and app containers equally.

If a Container specifies its own memory limit, but does not specify a memory request, Kubernetes automatically assigns a memory request that matches the limit. Similarly, if a Container specifies its own CPU limit, but does not specify a CPU request, Kubernetes automatically assigns a CPU request that matches the limit.

A Pod is given a QoS class of Burstable if:

  • The Pod does not meet the criteria for QoS class Guaranteed.

  • At least one Container in the Pod has a memory or CPU request.

For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not have any memory or CPU limits or requests.

Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.

Memory is an incompressible resource and so let’s discuss the semantics of memory management a bit.

  • BestEffort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory. These containers can use any amount of free memory in the node though.

  • Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.

  • Burstable pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no BestEffort pods exist.

2.5. Node Allocatable

Kubernetes nodes can be scheduled to Capacity. Pods can consume all the available capacity on a node by default. This is an issue because nodes typically run quite a few system daemons that power the OS and Kubernetes itself. Unless resources are set aside for these system daemons, pods and system daemons compete for resources and lead to resource starvation issues on the node.

The kubelet exposes a feature named 'Node Allocatable' that helps to reserve compute resources for system daemons.

node capacity

  cpu:                2
  ephemeral-storage:  102685624Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3993764Ki
  pods:               110
  cpu:                1800m
  ephemeral-storage:  94425355722
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3686564Ki
  pods:               110


Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1255m (69%)  5110m (283%)
  memory             685Mi (19%)  5120400Mi (142227%)
  ephemeral-storage  1Mi (0%)     2Mi (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)

2.5.1. Kube Reserved

  • Kubelet Flag: --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

  • Kubelet Flag: --kube-reserved-cgroup=

kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for system daemons that are run as pods. kube-reserved is typically a function of pod density on the nodes.

In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the specified number of process IDs for kubernetes system daemons.

2.5.2. System Reserved

  • Kubelet Flag: --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-storage=1Gi][,][pid=1000]

  • Kubelet Flag: --system-reserved-cgroup=

system-reserved is meant to capture resource reservation for OS system daemons like sshd, udev, etc. system-reserved should reserve memory for the kernel too since kernel memory is not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is also recommended (user.slice in systemd world).

In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the specified number of process IDs for OS system daemons.

2.5.3. Eviction Thresholds

  • Kubelet Flag: --eviction-hard=[memory.available<500Mi]

Memory pressure at the node level leads to System OOMs which affects the entire node and all pods running on it. Nodes can go offline temporarily until memory has been reclaimed. To avoid (or reduce the probability of) system OOMs kubelet provides out of resource management. Evictions are supported for memory and ephemeral-storage only.

3. Pod Disruptions

Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.

We call these unavoidable cases involuntary disruptions to an application. Examples are:

  • a hardware failure of the physical machine backing the node

  • cluster administrator deletes VM (instance) by mistake

  • cloud provider or hypervisor failure makes VM disappear

  • a kernel panic

  • the node disappears from the cluster due to cluster network partition

  • eviction of a pod due to the node being out-of-resources.

Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.

We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:

  • deleting the deployment or other controller that manages the pod

  • updating a deployment’s pod template causing a restart

  • directly deleting a pod (e.g. by accident)

Cluster administrator actions include:

  • Draining a node for repair or upgrade.

  • Draining a node from a cluster to scale the cluster down

  • Removing a pod from a node to permit something else to fit on that node.

If none voluntary disruptions are enabled for your cluster, you can skip creating Pod Disruption Budgets.

3.1. Pod disruption budgets

Kubernetes offers features to help you run highly available applications even when you introduce frequent voluntary disruptions.

As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.

Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets by calling the Eviction API (e.g. kubectl drain) instead of directly deleting pods or deployments.

PDBs cannot prevent involuntary disruptions from occurring, but they do count against the budget.

Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource.

When a pod is evicted using the eviction API, it is gracefully terminated, honoring the terminationGracePeriodSeconds setting in its PodSpec.

3.2. Think about how your application reacts to disruptions

Decide how many instances can be down at the same time for a short period due to a voluntary disruption.

  • Stateless frontends:

    • Concern: don’t reduce serving capacity by more than 10%.

      • Solution: use PDB with minAvailable 90% for example.

  • Single-instance Stateful Application:

    • Concern: do not terminate this application without talking to me.

      • Possible Solution 1: Do not use a PDB and tolerate occasional downtime.

      • Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards.

  • Multiple-instance Stateful application such as Consul, ZooKeeper, or etcd:

    • Concern: Do not reduce number of instances below quorum, otherwise writes fail.

      • Possible Solution 1: set maxUnavailable to 1 (works with varying scale of application).

      • Possible Solution 2: set minAvailable to quorum-size (e.g. 3 when scale is 5). (Allows more disruptions at once).

  • Restartable Batch Job:

    • Concern: Job needs to complete in case of voluntary disruption.

      • Possible solution: Do not create a PDB. The Job controller will create a replacement pod.

4. Scheduling, Preemption and Eviction

In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.

4.1. Taints and Tolerations

Node affinity is a property of Pods that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite — they allow a node to repel a set of pods.

Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.

Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.

You add a taint to a node using kubectl taint. For example,

kubectl taint nodes node1 key1=value1:NoSchedule

places a taint on node node1. The taint has key key1, value value1, and taint effect NoSchedule. This means that no pod will be able to schedule onto node1 unless it has a matching toleration.

To remove the taint added by the command above, you can run:

kubectl taint nodes node1 key1=value1:NoSchedule-

You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match" the taint created by the kubectl taint line above, and thus a pod with either toleration would be able to schedule onto node1:

- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
- key: "key1"
  operator: "Exists"
  effect: "NoSchedule"

The default value for operator is Equal.

A toleration "matches" a taint if the keys are the same and the effects are the same, and:

  • the operator is Exists (in which case no value should be specified), or

  • the operator is Equal and the `value`s are equal.

There are two special cases:

  • An empty key with operator Exists matches all keys, values and effects which means this will tolerate everything.

  • An empty effect matches all effects with key key1.

The NoExecute taint effect affects pods that are already running on the node as follows

  • pods that do not tolerate the taint are evicted immediately

  • pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever

  • pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time

The node controller automatically taints a Node when certain conditions are true. The following taints are built in:

  • node.kubernetes.io/not-ready:

    Node is not ready. This corresponds to the NodeCondition Ready being "False".

  • node.kubernetes.io/unreachable:

    Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being "Unknown".

  • node.kubernetes.io/memory-pressure:

    Node has memory pressure.

  • node.kubernetes.io/disk-pressure:

    Node has disk pressure.

  • node.kubernetes.io/pid-pressure:

    Node has PID pressure.

  • node.kubernetes.io/network-unavailable:

    Node’s network is unavailable.

  • node.kubernetes.io/unschedulable:

    Node is unschedulable.

  • node.cloudprovider.kubernetes.io/uninitialized:

    When the kubelet is started with "external" cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

In case a node is to be evicted, the node controller or the kubelet adds relevant taints with NoExecute effect. If the fault condition returns to normal the kubelet or node controller can remove the relevant taint(s).

DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds:

  • node.kubernetes.io/unreachable

  • node.kubernetes.io/not-ready

This ensures that DaemonSet pods are never evicted due to these problems.

4.2. Pod Priority and Preemption

Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.

To use priority and preemption:

  • Add one or more PriorityClasses.

  • Create Pods with priorityClassName set to one of the added PriorityClasses.

A PriorityClass is a non-namespaced object that defines a mapping from a priority class name to the integer value of the priority. The name is specified in the name field of the PriorityClass object’s metadata. The value is specified in the required value field. The higher the value, the higher the priority. The name of a PriorityClass object must be a valid DNS subdomain name, and it cannot be prefixed with system-.

$ kubectl get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            60d
system-node-critical      2000001000   false            60d

$ kubectl get pc system-cluster-critical -oyaml
apiVersion: scheduling.k8s.io/v1
description: Used for system critical pods that must run in the cluster, but can be
  moved to another node if necessary.
kind: PriorityClass
  creationTimestamp: "2021-09-22T09:29:35Z"
  generation: 1
  name: system-cluster-critical
  resourceVersion: "84"
  uid: ff8cb5f8-d989-4a68-b902-d3b1ed891f9b
preemptionPolicy: PreemptLowerPriority
value: 2000000000

kubelet node-pressure eviction does not evict Pods when their usage does not exceed their requests. If a Pod with lower priority is not exceeding its requests, it won’t be evicted. Another Pod with higher priority that exceeds its requests may be evicted.

4.3. Node-pressure Eviction

Node-pressure eviction is the process by which the kubelet proactively terminates pods to reclaim resources on nodes.

The kubelet monitors resources like CPU, memory, disk space, and filesystem inodes on your cluster’s nodes. When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation.

During a node-pressure eviction, the kubelet sets the PodPhase for the selected pods to Failed. This terminates the pods.

Node-pressure eviction is not the same as API-initiated eviction (e.g. kubectl drain).

The kubelet does not respect your configured PodDisruptionBudget or the pod’s terminationGracePeriodSeconds. If you use soft eviction thresholds, the kubelet respects your configured eviction-max-pod-grace-period. If you use hard eviction thresholds, it uses a 0s grace period for termination.

If the pods are managed by a workload resource (such as StatefulSet or Deployment) that replaces failed pods, the control plane or kube-controller-manager creates new pods in place of the evicted pods.

The kubelet attempts to reclaim node-level resources before it terminates end-user pods. For example, it removes unused container images when disk resources are starved.
  • Eviction signals

    Eviction signals are the current state of a particular resource at a specific point in time. Kubelet uses eviction signals to make eviction decisions by comparing the signals to eviction thresholds, which are the minimum amount of the resource that should be available on the node.

    Kubelet uses the following eviction signals:

    Eviction Signal Description


    memory.available := node.status.capacity[memory] - node.stats.memory.workingSet


    nodefs.available := node.stats.fs.available


    nodefs.inodesFree := node.stats.fs.inodesFree


    imagefs.available := node.stats.runtime.imagefs.available


    imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree


    pid.available := node.stats.rlimit.maxpid - node.stats.rlimit.curproc

  • Eviction thresholds

    You can specify custom eviction thresholds for the kubelet to use when it makes eviction decisions.

    Eviction thresholds have the form [eviction-signal][operator][quantity], where:

  • eviction-signal is the eviction signal to use.

  • operator is the relational operator you want, such as < (less than).

  • quantity is the eviction threshold amount, such as 1Gi. The value of quantity must match the quantity representation used by Kubernetes. You can use either literal values or percentages (%).

    For example, if a node has 10Gi of total memory and you want trigger eviction if the available memory falls below 1Gi, you can define the eviction threshold as either memory.available<10% or memory.available<1Gi. You cannot use both.

    You can configure soft and hard eviction thresholds.

    • Soft eviction thresholds

      A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The kubelet does not evict pods until the grace period is exceeded. The kubelet returns an error on startup if there is no specified grace period.

      You can specify both a soft eviction threshold grace period and a maximum allowed pod termination grace period for kubelet to use during evictions. If you specify a maximum allowed grace period and the soft eviction threshold is met, the kubelet uses the lesser of the two grace periods. If you do not specify a maximum allowed grace period, the kubelet kills evicted pods immediately without graceful termination.

      You can use the following flags to configure soft eviction thresholds:

      • eviction-soft: A set of eviction thresholds like memory.available<1.5Gi that can trigger pod eviction if held over the specified grace period.

      • eviction-soft-grace-period: A set of eviction grace periods like memory.available=1m30s that define how long a soft eviction threshold must hold before triggering a Pod eviction.

      • eviction-max-pod-grace-period: The maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

    • Hard eviction thresholds

      A hard eviction threshold has no grace period. When a hard eviction threshold is met, the kubelet kills pods immediately without graceful termination to reclaim the starved resource.

      You can use the eviction-hard flag to configure a set of hard eviction thresholds like memory.available<1Gi.

      The kubelet has the following default hard eviction thresholds:

      nodefs.inodesFree<5% (Linux nodes)

4.3.1. Pod selection for kubelet eviction

If the kubelet’s attempts to reclaim node-level resources don’t bring the eviction signal below the threshold, the kubelet begins to evict end-user pods.

The kubelet uses the following parameters to determine pod eviction order:

  • Whether the pod’s resource usage exceeds requests

  • Pod Priority

  • The pod’s resource usage relative to requests

As a result, kubelet ranks and evicts pods in the following order:

  • BestEffort or Burstable pods where the usage exceeds requests. These pods are evicted based on their Priority and then by how much their usage level exceeds the request.

  • Guaranteed pods and Burstable pods where the usage is less than requests are evicted last, based on their Priority.

The kubelet does not use the pod’s QoS class to determine the eviction order. You can use the QoS class to estimate the most likely pod eviction order when reclaiming resources like memory. QoS does not apply to EphemeralStorage requests, so the above scenario will not apply if the node is, for example, under DiskPressure.

5. Kubernetes Autoscaling

The foundation of building cost-optimized applications is spreading the cost-saving culture across teams. Beyond moving cost discussions to the beginning of the development process, this approach forces you to better understand the environment that your applications are running in—in this context, the GKE environment.

bp for running cost effective kubernetes apps on gke approach

In order to achieve low cost and application stability, you must correctly set or tune some features and configurations (such as autoscaling, machine types, and region selection). Another important consideration is your workload type because, depending on the workload type and your application’s requirements, you must apply different configurations in order to further lower your costs. Finally, you must monitor your spending and create guardrails so that you can enforce best practices early in your development cycle.

Kubernetes has three scalability tools. Two of these, the Horizontal pod autoscaler (HPA) and the Vertical pod autoscaler (VPA), function on the application abstraction layer. The cluster autoscaler (CA) works on the infrastructure layer.

bp for running cost effective kubernetes apps on gke scenarios

5.1. Horizontal Pod Autoscaler

Horizontal Pod Autoscaler (HPA) is meant for scaling applications that are running in Pods based on metrics that express load. You can configure either CPU utilization or other custom metrics (for example, requests per second). In short, HPA adds and deletes Pods replicas, and it is best suited for stateless workers that can spin up quickly to react to usage spikes, and shut down gracefully to avoid workload instability.

bp for running cost effective kubernetes apps on gke threshold

Even if you guarantee that your application can start up in a matter of seconds, this extra time is required when Cluster Autoscaler adds new nodes to your cluster or when Pods are throttled due to lack of resources.

The following are best practices for enabling HPA in your application:

  • Size your application correctly by setting appropriate resource requests and limits.

  • Set your target utilization to reserve a buffer that can handle requests during a spike.

  • Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.

  • Set meaningful readiness and liveness probes.

  • Make sure that your Metrics Server is always up and running.

  • Inform clients of your application that they must consider implementing exponential retries for handling transient issues.

Make sure your applications are shutting down according to Kubernetes expectations:

  • Don’t stop accepting new requests right after SIGTERM.

    Your application must not stop immediately, but instead finish all requests that are in flight and still listen to incoming connections that arrive after the Pod termination begins. It might take a while for Kubernetes to update all kube-proxies and load balancers. If your application terminates before these are updated, some requests might cause errors on the client side.

  • If your application doesn’t follow the preceding practice, use the preStop hook.

    Most programs don’t stop accepting requests right away. However, if you’re using third-party code or are managing a system that you don’t have control over, such as nginx, the preStop hook is a good option for triggering a graceful shutdown without modifying the application. One common strategy is to execute, in the preStop hook, a sleep of a few seconds to postpone the SIGTERM. This gives Kubernetes extra time to finish the Pod deletion process, and reduces connection errors on the client side.

  • Handle SIGTERM for cleanups.

    If your application must clean up or has an in-memory state that must be persisted before the process terminates, now is the time to do it. Different programming languages have different ways to catch this signal, so find the right way in your language.

  • Configure terminationGracePeriodSeconds to fit your application needs.

    Some applications need more than the default 30 seconds to finish. In this case, you must specify terminationGracePeriodSeconds. High values might increase time for node upgrades or rollouts, for example. Low values might not allow enough time for Kubernetes to finish the Pod termination process. Either way, we recommend that you set your application’s termination period to less than 10 minutes because Cluster Autoscaler honors it for 10 minutes only.

  • If your application uses container-native load balancing, start failing your readiness probe when you receive a SIGTERM.

    This action directly signals load balancers to stop forwarding new requests to the backend Pod. Depending on the race between health check configuration and endpoint programming, the backend Pod might be taken out of traffic earlier.

5.2. Vertical Pod Autoscaler

Unlike HPA, which adds and deletes Pod replicas for rapidly reacting to usage spikes, Vertical Pod Autoscaler (VPA) observes Pods over time and gradually finds the optimal CPU and memory resources required by the Pods. Setting the right resources is important for stability and cost efficiency. If your Pod resources are too small, your application can either be throttled or it can fail due to out-of-memory errors. If your resources are too large, you have waste and, therefore, larger bills. VPA is meant for stateless and stateful workloads not handled by HPA or when you don’t know the proper Pod resource requests.

bp for running cost effective kubernetes apps on gke vpa

VPA can work in three different modes:

  • Off:.

    In this mode, also known as recommendation mode, VPA does not apply any change to your Pod. The recommendations are calculated and can be inspected in the VPA object.

  • Initial:

    VPA assigns resource requests only at Pod creation and never changes them later.

  • Auto:

    VPA updates CPU and memory requests during the life of a Pod. That means, the Pod is deleted, CPU and memory are adjusted, and then a new Pod is started.

If you plan to use VPA, the best practice is to start with the Off mode for pulling VPA recommendations. Make sure it’s running for 24 hours, ideally one week or more, before pulling recommendations. Then, only when you feel confident, consider switching to either Initial or Auto mode.

Follow these best practices for enabling VPA, either in Initial or Auto mode, in your application:

  • Don’t use VPA either Initial or Auto mode if you need to handle sudden spikes in traffic. Use HPA instead.

  • Make sure your application can grow vertically. Set minimum and maximum container sizes in the VPA objects to avoid the autoscaler making significant changes when your application is not receiving traffic.

  • Don’t make abrupt changes, such as dropping the Pod’s replicas from 30 to 5 all at once. This kind of change requires a new deployment, new label set, and new VPA object.

  • Make sure your application starts as quickly as possible and shuts down according to Kubernetes expectations.

  • Set meaningful readiness and liveness probes.

  • Make sure that your Metrics Server is always up and running.

  • Inform clients of your application that they must consider implementing exponential retries for handling transient issues.

  • Consider using node auto-provisioning along with VPA so that if a Pod gets large enough to fit into existing machine types, Cluster Autoscaler provisions larger machines to fit the new Pod.

Whether you are considering using Auto mode, make sure you also follow these practices:

  • Make sure your application can be restarted while receiving traffic.

  • Add Pod Disruption Budget (PDB) to control how many Pods can be taken down at the same time.

5.3. Cluster Autoscaler

Cluster Autoscaler (CA) automatically resizes the underlying computer infrastructure. CA provides nodes for Pods that don’t have a place to run in the cluster and removes under-utilized nodes. CA is optimized for the cost of infrastructure. In other words, if there are two or more node types in the cluster, CA chooses the least expensive one that fits the given demand.

Unlike HPA and VPA, CA doesn’t depend on load metrics. Instead, it’s based on scheduling simulation and declared Pod requests. It’s a best practice to enable CA whenever you are using either HPA or VPA. This practice ensures that if your Pod autoscalers determine that you need more capacity, your underlying infrastructure grows accordingly.

bp for running cost effective kubernetes apps on gke ca

As these diagrams show, CA automatically adds and removes compute capacity to handle traffic spikes and save you money when your customers are sleeping. It is a best practice to define Pod Disruption Budget (PDB) for all your applications. It is particularly important at the CA scale-down phase when PDB controls the number of replicas that can be taken down at one time.

Certain Pods cannot be restarted by any autoscaler when they cause some temporary disruption, so the node they run on can’t be deleted. For example, system Pods (such as metrics-server and kube-dns), and Pods using local storage won’t be restarted. However, you can change this behavior by defining PDBs for these system Pods and by setting "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation for Pods using local storage that are safe for the autoscaler to restart. Moreover, consider running long-lived Pods that can’t be restarted on a separate node pool, so they don’t block scale-down of other nodes. Finally, learn how to analyze CA events in the logs to understand why a particular scaling activity didn’t happen as expected.

The following is a summary of the best practices for enabling Cluster Autoscaler in your cluster:

  • Use either HPA or VPA to autoscale your workloads.

  • Make sure you are following the best practices described in the chosen Pod autoscaler.

  • Size your application correctly by setting appropriate resource requests and limits or use VPA.

  • Define a PDB for your applications.

  • Define PDB for system Pods that might block your scale-down. For example, kube-dns. To avoid temporary disruption in your cluster, don’t set PDB for system Pods that have only 1 replica (such as metrics-server).

  • Run short-lived Pods and Pods that can be restarted in separate node pools, so that long-lived Pods don’t block their scale-down.

  • Avoid over-provisioning by configuring idle nodes in your cluster. For that, you must know your minimum capacity—for many companies it’s during the night—and set the minimum number of nodes in your node pools to support that capacity.

  • If you need extra capacity to handle requests during spikes, use pause Pods, which are discussed in [](Autoscaler and over-provisioning).

However, as noted in the Horizontal Pod Autoscaler section, scale-ups might take some time due to infrastructure provisioning. To visualize this difference in time and possible scale-up scenarios, consider the following image.

bp for running cost effective kubernetes apps on gke scale up

When your cluster has enough room for deploying new Pods, one of the Workload scale-up scenarios is triggered. Meaning, if an existing node never deployed your application, it must download its container images before starting the Pod (scenario 1). However, if the same node must start a new Pod replica of your application, the total scale-up time decreases because no image download is required (scenario 2).

When your cluster doesn’t have enough room for deploying new Pods, one of the Infrastructure and Workload scale-up scenarios is triggered. This means that Cluster Autoscaler must provision new nodes and start the required software before approaching your application (scenario 1). If you use node auto-provisioning, depending on the workload scheduled, new node pools might be required. In this situation, the total scale-up time increases because Cluster Autoscaler has to provision nodes and node pools (scenario 2).

For scenarios where new infrastructure is required, don’t squeeze your cluster too much—meaning, you must over-provision but only for reserving the necessary buffer to handle the expected peak requests during scale-ups.

There are two main strategies for this kind of over-provisioning:

  • Fine-tune the HPA utilization target. The following equation is a simple and safe way to find a good CPU target:

    (1 - buff)/(1 + perc)
    • buff is a safety buffer that you can set to avoid reaching 100% CPU. This variable is useful because reaching 100% CPU means that the latency of request processing is much higher than usual.

    • perc is the percentage of traffic growth you expect in two or three minutes.

    For example, if you expect a growth of 30% in your requests and you want to avoid reaching 100% of CPU by defining a 10% safety buffer, your formula would look like this:

    (1 - 0.1)/(1 + 0.3) = 0.69
  • Configure pause Pods. There is no way to configure Cluster Autoscaler to spin up nodes upfront. Instead, you can set an HPA utilization target to provide a buffer to help handle spikes in load. However, if you expect large bursts, setting a small HPA utilization target might not be enough or might become too expensive.

    An alternative solution for this problem is to use pause Pods. Pause Pods are low-priority deployments that do nothing but reserve room in your cluster. Whenever a high-priority Pod is scheduled, pause Pods get evicted and the high-priority Pod immediately takes their place. The evicted pause Pods are then rescheduled, and if there is no room in the cluster, Cluster Autoscaler spins up new nodes for fitting them. It’s a best practice to have only a single pause Pod per node. For example, if you are using 4 CPU nodes, configure the pause Pods' CPU request with around 3200m.