Kubernetes Networking
1. The Kubernetes network model
Networking is a central part of Kubernetes, but it can be challenging to understand exactly how it is expected to work. There are 4 distinct networking problems to address: [1]
-
Highly-coupled container-to-container communications
-
Pod-to-Pod communications
-
Pod-to-Service communications
-
External-to-Service communications
Kubernetes clusters require to allocate non-overlapping IP addresses for Pods, Services and Nodes, from a range of available addresses configured in the following components: [1]
-
The network plugin is configured to assign IP addresses to Pods.
-
The kube-apiserver is configured to assign IP addresses to Services.
-
The kubelet or the cloud-controller-manager is configured to assign IP addresses to Nodes.
The Kubernetes network model is built out of several pieces:
-
Each pod in a cluster gets its own unique cluster-wide IP address.
-
A pod has its own private network namespace which is shared by all of the containers within the pod.
-
Processes running in different containers in the same pod can communicate with each other over
localhost
.
-
-
The pod network (also called a cluster network) handles communication between pods. It ensures that (barring intentional network segmentation):
-
All pods can communicate with all other pods, whether they are on the same node or on different nodes.
-
Pods can communicate with each other directly, without the use of proxies or address translation (NAT).
On Windows, this rule does not apply to host-network pods.
-
Agents on a node (such as system daemons, or kubelet) can communicate with all pods on that node.
-
-
The Service API provide a stable (long lived) IP address or hostname for a service implemented by one or more backend pods, where the individual pods making up the service can change over time.
-
Kubernetes automatically manages
EndpointSlice
objects to provide information about the pods currently backing a Service. -
A service proxy implementation monitors the set of
Service
andEndpointSlice
objects, and programs the data plane to route service traffic to its backends, by using operating system or cloud provider APIs to intercept or rewrite packets.
-
-
The Gateway API (or its predecessor,
Ingress
) allows making Services accessible to clients that are outside the cluster.-
A simpler, but less-configurable, mechanism for cluster ingress is available via the Service API’s
type: LoadBalancer
, when using a supported Cloud Provider.
-
-
NetworkPolicy
is a built-in Kubernetes API that allows controlling traffic between pods, or between pods and the outside world.
The network model is implemented by the container runtime on each node. The most common container runtimes use Container Network Interface (CNI) plugins to manage their network and security capabilities. [1]
Flannel is a simple and easy way to configure a layer 3 network fabric designed for Kubernetes.
-
Flannel runs a small, single binary agent called
flanneld
on each host, and is responsible for allocating a subnet lease to each host out of a larger, preconfigured address space. -
Packets are forwarded using one of several backend mechanisms including VXLAN and various cloud integrations.
Virtual Extensible LAN | From Wikipedia, the free encyclopediaVirtual eXtensible LAN (VXLAN) is a network virtualization technology that uses a VLAN-like encapsulation technique to encapsulate OSI layer 2 Ethernet frames within layer 4 UDP datagrams, using 4789 as the default IANA-assigned destination UDP port number, although many implementations that predate the IANA assignment use port 8472.
$ kubectl get cm -n kube-flannel kube-flannel-cfg -ojsonpath='{.data.net-conf\.json}' { "Network": "10.244.0.0/16", (1) "EnableNFTables": false, "Backend": { "Type": "vxlan" (2) } } $ kubectl get cm -n kube-flannel kube-flannel-cfg -ojsonpath='{.data.cni-conf\.json}' { "name": "cbr0", (3) "cniVersion": "0.3.1", "plugins": [ { "type": "flannel", "delegate": { "hairpinMode": true, "isDefaultGateway": true } }, { "type": "portmap", "capabilities": { "portMappings": true } } ] } $ cat /run/flannel/subnet.env FLANNEL_NETWORK=10.244.0.0/16 FLANNEL_SUBNET=10.244.0.1/24 (4) FLANNEL_MTU=1450 FLANNEL_IPMASQ=true $ ip a s cni0 (5) 103: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether 9e:cd:7d:2d:fc:58 brd ff:ff:ff:ff:ff:ff inet 10.244.0.1/24 brd 10.244.0.255 scope global cni0 valid_lft forever preferred_lft forever inet6 fe80::9ccd:7dff:fe2d:fc58/64 scope link valid_lft forever preferred_lft forever
1 The network CIDR for Flannel. 2 Specifies that the VXLAN backend is used for Flannel. 3 The cbr0
interface is typically created by CNI plugins that use a bridge network, such as the bridge plugin or older versions of Flannel when used in bridge mode.4 The subnet assigned to the node. 5 The cni0
is a bridge interface that is often created by CNI (Container Network Interface) plugins to connect pods to the host network.$ kubectl get nodes -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node-0 Ready control-plane 76d v1.31.3 192.168.71.40 <none> Debian GNU/Linux 12 (bookworm) 6.1.0-28-amd64 containerd://1.7.25 node-2 Ready control-plane 55m v1.31.6 192.168.71.42 <none> Debian GNU/Linux 12 (bookworm) 6.1.0-27-amd64 containerd://1.7.25 $ ip r # node-0 default via 192.168.71.1 dev enp0s3 10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1 (1) 10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink (2) 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 $ ip a show cni0 103: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether 9e:cd:7d:2d:fc:58 brd ff:ff:ff:ff:ff:ff inet 10.244.0.1/24 brd 10.244.0.255 scope global cni0 valid_lft forever preferred_lft forever inet6 fe80::9ccd:7dff:fe2d:fc58/64 scope link valid_lft forever preferred_lft forever $ ip a show flannel.1 102: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default link/ether e2:ae:60:46:68:e9 brd ff:ff:ff:ff:ff:ff inet 10.244.0.0/32 scope global flannel.1 valid_lft forever preferred_lft forever inet6 fe80::e0ae:60ff:fe46:68e9/64 scope link valid_lft forever preferred_lft forever $ ip r # node-2 default via 192.168.71.1 dev enp0s3 onlink 10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1 (1) 10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink (2) 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 192.168.71.0/24 dev enp0s3 proto kernel scope link src 192.168.71.42
1 proto kernel
: Indicates that this route was added by the kernel.src 10.244.0.1
: Specifies the source IP address for traffic on this route.2 dev flannel.1
: Specifies that theflannel.1
interface (a VXLAN interface used by Flannel) should be used. -
Flannel does not control how containers are networked to the host, only how the traffic is transported between hosts.
-
Flannel does provide a CNI plugin for Kubernetes.
2. Network Utilities for Debugging Containers & Kubernetes
A Simple and Stupid Network Utilities for Debugging Containers & Kubernetes.
kubectl create -n default deployment net-tools \
--image docker.io/qqbuby/net-tools:2.3 \
-- tail -f /dev/null
$ kubectl get po -l app=net-tools -w
NAME READY STATUS RESTARTS AGE
net-tools-8569ddf9fd-dn6wf 1/1 Running 0 8m37s
$ kubectl exec net-tools-8569ddf9fd-dn6wf -- ip r
default via 10.244.1.1 dev eth0
10.244.0.0/16 via 10.244.1.1 dev eth0
10.244.1.0/24 dev eth0 proto kernel scope link src 10.244.1.4
$ kubectl exec net-tools-8569ddf9fd-dn6wf -- dig +search kubernetes
; <<>> DiG 9.18.24-1-Debian <<>> +search kubernetes
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53501
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: ca5d92ec874c299a (echoed)
;; QUESTION SECTION:
;kubernetes.default.svc.cluster.local. IN A
;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 30 IN A 10.96.0.1
;; Query time: 1 msec
;; SERVER: 10.96.0.10#53(10.96.0.10) (UDP)
;; WHEN: Thu Feb 29 05:18:31 UTC 2024
;; MSG SIZE rcvd: 129
3. Services in Kubernetes
In Kubernetes, a Service is a method for exposing a network application that is running as one or more Pods in your cluster. [3]
Kubernetes Service types allow you to specify what kind of Service you want.
-
ClusterIP
Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the default that is used if you don’t explicitly specify a
type
for a Service. You can expose the Service to the public internet using an Ingress or a Gateway. -
NodePort
Exposes the Service on each Node’s IP at a static port (the
NodePort
). To make the node port available, Kubernetes sets up a cluster IP address, the same as if you had requested a Service oftype: ClusterIP
.$ cat << EOF | kubectl apply -f - > apiVersion: v1 kind: Service metadata: name: my-service spec: ports: - port: 80 targetPort: 9376 type: NodePort > EOF service/my-service configured $ kubectl get svc my-service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE my-service NodePort 10.98.44.45 <none> 80:31868/TCP 7s
-
LoadBalancer
Exposes the Service externally using an external load balancer. Kubernetes does not directly offer a load balancing component; you must provide one, or you can integrate your Kubernetes cluster with a cloud provider.
-
ExternalName
Maps the Service to the contents of the
externalName
field (for example, to the hostnameapi.foo.bar.example
). The mapping configures your cluster’s DNS server to return aCNAME
record with that external hostname value. No proxying of any kind is set up.$ cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Service metadata: name: httpbin-org spec: type: ExternalName externalName: httpbin.org EOF service/httpbin-org created $ kubectl get svc httpbin-org NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE httpbin-org ExternalName <none> httpbin.org <none> 12s $ kubectl exec net-tools-8569ddf9fd-dn6wf -- dig +search +short httpbin-org CNAME httpbin.org.
If your workload speaks HTTP, you might choose to use an Ingress to control how web traffic reaches that workload. Ingress is not a Service type, but it acts as the entry point for your cluster. The Gateway API for Kubernetes provides extra capabilities beyond Ingress and Service. |
3.1. Services without selectors
Services most commonly abstract access to Kubernetes Pods thanks to the selector, but when used with a corresponding set of EndpointSlices objects and without a selector, the Service can abstract other kinds of backends, including ones that run outside the cluster.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
# Because this Service has no selector, the corresponding EndpointSlice (and
# legacy Endpoints) objects are not created automatically.
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
# You can map the Service to the network address and port where it's
# running, by adding an EndpointSlice object manually.
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
name: my-service-1 # by convention, use the name of the Service
# as a prefix for the name of the EndpointSlice
labels:
# You should set the "kubernetes.io/service-name" label.
# Set its value to match the name of the Service
kubernetes.io/service-name: my-service
addressType: IPv4
ports:
- name: '' # empty because port 9376 is not assigned as a well-known
# port (by IANA)
appProtocol: http
protocol: TCP
port: 9376
endpoints:
- addresses:
- "10.4.5.6"
- addresses:
- "10.1.2.3"
3.2. Headless Services
For headless Services, a cluster IP is not allocated, by explicitly specifying "None" for the cluster IP address (.spec.clusterIP
), kube-proxy does not handle these Services, and there is no load balancing or proxying done by the platform for them. How DNS is automatically configured depends on whether the Service has selectors defined:
-
With selectors
For headless Services that define selectors, the endpoints controller creates EndpointSlices in the Kubernetes API, and modifies the DNS configuration to return A or AAAA records (IPv4 or IPv6 addresses) that point directly to the Pods backing the Service.
-
Without selectors
For headless Services that do not define selectors, the control plane does not create EndpointSlice objects. However, the DNS system looks for and configures either:
-
DNS CNAME records for
type: ExternalName
Services. -
DNS A / AAAA records for all IP addresses of the Service’s ready endpoints, for all Service types other than
ExternalName
.-
For IPv4 endpoints, the DNS system creates A records.
-
For IPv6 endpoints, the DNS system creates AAAA records.
-
When you define a headless Service without a selector, the
port
must match thetargetPort
. -
3.3. Virtual IPs and Service Proxies
Every node in a Kubernetes cluster runs a kube-proxy (unless you have deployed your own alternative component in place of kube-proxy
). [4]
The kube-proxy
component is responsible for implementing a virtual IP
mechanism for Services of type
other than ExternalName
.
-
Each instance of kube-proxy watches the Kubernetes control plane for the addition and removal of Service and EndpointSlice objects.
-
For each Service, kube-proxy calls appropriate APIs (depending on the kube-proxy mode) to configure the node to capture traffic to the Service’s
clusterIP
andport
, and redirect that traffic to one of the Service’s endpoints (usually a Pod, but possibly an arbitrary user-provided IP address). -
A control loop ensures that the rules on each node are reliably synchronized with the Service and EndpointSlice state as indicated by the API server.
The kube-proxy starts up in different modes, which are determined by its configuration.
On Linux nodes, the available modes for kube-proxy are:
-
iptables
A mode where the kube-proxy configures packet forwarding rules using iptables.
-
ipvs
a mode where the kube-proxy configures packet forwarding rules using ipvs.
-
nftables
a mode where the kube-proxy configures packet forwarding rules using nftables.
There is only one mode available for kube-proxy on Windows:
-
kernelspace
a mode where the kube-proxy configures packet forwarding rules in the Windows kernel
3.3.1. iptables proxy mode
In iptables
mode, kube-proxy configures packet forwarding rules using the iptables API of the kernel netfilter subsystem.
-
When kube-proxy on a node sees a new Service, it installs a series of iptables rules which redirect from the virtual IP address to more iptables rules, defined per Service.
-
The per-Service rules link to further rules for each backend endpoint, and the per-endpoint rules redirect traffic (using destination NAT) to the backends.
-
When a client connects to the Service’s virtual IP address the iptables rule kicks in.
A backend is chosen (either based on session affinity or randomly) and packets are redirected to the backend without rewriting the client IP address.
Check the kube-proxy model with the /proxyMode
endpoint.
$ curl localhost:10249/proxyMode
iptables
$ sudo iptables -t nat -n -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-ERIFXISQEP7F7OF4 6 -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SVC-JD5MR3NA4I4DYORP 6 -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-Z4ANX4WAEWEBLCTM 6 -- 0.0.0.0/0 10.109.25.21 /* kube-system/metrics-server:https cluster IP */ tcp dpt:443
KUBE-SVC-CG5I4G2RS3ZVWGLK 6 -- 0.0.0.0/0 10.107.96.185 /* ingress-nginx/ingress-nginx-controller:http cluster IP */ tcp dpt:80
KUBE-SVC-EDNDUDH2C75GIR6O 6 -- 0.0.0.0/0 10.107.96.185 /* ingress-nginx/ingress-nginx-controller:https cluster IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y 6 -- 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-EZYNCFY2F7N6OQA2 6 -- 0.0.0.0/0 10.103.76.154 /* ingress-nginx/ingress-nginx-controller-admission:https-webhook cluster IP */ tcp dpt:443
KUBE-SVC-LWGIUP67CTAM2576 6 -- 0.0.0.0/0 10.107.96.185 /* ingress-nginx/ingress-nginx-controller:prometheus cluster IP */ tcp dpt:10254
KUBE-SVC-TCOU7JCQXEZGVUNU 17 -- 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
KUBE-NODEPORTS 0 -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
$ sudo iptables -t nat -n -L KUBE-SVC-ERIFXISQEP7F7OF4
Chain KUBE-SVC-ERIFXISQEP7F7OF4 (1 references)
target prot opt source destination
KUBE-MARK-MASQ 6 -- !10.244.0.0/16 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
KUBE-SEP-YXU7ECKUN6RQCSDC 0 -- 0.0.0.0/0 0.0.0.0/0 /* kube-system/kube-dns:dns-tcp -> 10.244.1.99:53 */ statistic mode random probability 0.50000000000
KUBE-SEP-C4WJXZ3GDNSPOCVX 0 -- 0.0.0.0/0 0.0.0.0/0 /* kube-system/kube-dns:dns-tcp -> 10.244.2.142:53 */
3.4. Traffic policies
You can set the .spec.internalTrafficPolicy
and .spec.externalTrafficPolicy
fields to control how Kubernetes routes traffic to healthy (“ready”) backends.
-
Internal traffic policy
FEATURE STATE: Kubernetes v1.26 [stable]
You can set the
.spec.internalTrafficPolicy
field to control how traffic from internal sources is routed. Valid values areCluster
andLocal
.Set the field to
Cluster
to route internal traffic to all ready endpoints andLocal
to only route to ready node-local endpoints.If the traffic policy is
Local
and there are no node-local endpoints, traffic is dropped by kube-proxy.Service Internal Traffic Policy enables internal traffic restrictions to only route internal traffic to endpoints within the node the traffic originated from.
The "internal" traffic here refers to traffic originated from Pods in the current cluster. [5]
-
External traffic policy
You can set the
.spec.externalTrafficPolicy
field to control how traffic from external sources is routed. Valid values areCluster
andLocal
.Set the field to
Cluster
to route external traffic to all ready endpoints andLocal
to only route to ready node-local endpoints.If the traffic policy is Local and there are are no node-local endpoints, the kube-proxy does not forward any traffic for the relevant Service.
-
Traffic to terminating endpoints
FEATURE STATE: Kubernetes v1.28 [stable]
If the
ProxyTerminatingEndpoints
feature gate is enabled in kube-proxy and the traffic policy isLocal
, that node’s kube-proxy uses a more complicated algorithm to select endpoints for a Service.With the feature enabled, kube-proxy checks if the node has local endpoints and whether or not all the local endpoints are marked as terminating. If there are local endpoints and all of them are terminating, then kube-proxy will forward traffic to those terminating endpoints. Otherwise, kube-proxy will always prefer forwarding traffic to endpoints that are not terminating.
4. DNS for Services and Pods
Kubernetes creates DNS records for Services and Pods. You can contact Services with consistent DNS names instead of IP addresses. [6]
Kubernetes publishes information about Pods and Services which is used to program DNS. Kubelet configures Pods' DNS so that running containers can lookup Services by name rather than IP.
Services defined in the cluster are assigned DNS names. By default, a client Pod’s DNS search list includes the Pod’s own namespace and the cluster’s default domain.
A DNS query may return different results based on the namespace of the Pod making it. DNS queries that don’t specify a namespace are limited to the Pod’s namespace. Access Services in other namespaces by specifying it in the DNS query.
DNS queries may be expanded using the Pod’s /etc/resolv.conf
. Kubelet configures this file for each Pod.
nameserver 10.32.0.10
search <namespace>.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Services
-
A/AAAA records
"Normal" (not headless) Services are assigned DNS A and/or AAAA records, depending on the IP family or families of the Service, with a name of the form
my-svc.my-namespace.svc.cluster-domain.example
.Headless Services (without a cluster IP) Services are also assigned DNS A and/or AAAA records, with a name of the form
my-svc.my-namespace.svc.cluster-domain.example
. Unlike normal Services, this resolves to the set of IPs of all of the Pods selected by the Service. -
SRV records
SRV Records are created for named ports that are part of normal or headless services. For each named port, the SRV record has the form
_port-name._port-protocol.my-svc.my-namespace.svc.cluster-domain.example
.$ kubectl exec net-tools-8569ddf9fd-dn6wf -- nslookup _https._tcp.kubernetes Server: 10.96.0.10 Address: 10.96.0.10#53 Name: _https._tcp.kubernetes.default.svc.cluster.local Address: 10.96.0.1
Pods
-
A/AAAA records
Kube-DNS versions, prior to the implementation of the DNS specification, had the following DNS resolution:
pod-ipv4-address.my-namespace.pod.cluster-domain.example
.Any Pods exposed by a Service have the following DNS resolution available:
pod-ipv4-address.service-name.my-namespace.svc.cluster-domain.example
.$ kubectl get pod -l app=net-tools -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES net-tools-8569ddf9fd-dn6wf 1/1 Running 0 20m 10.244.1.101 node-2 <none> <none> $ kubectl exec net-tools-8569ddf9fd-dn6wf -- nslookup 10-244-1-101.default.pod Server: 10.96.0.10 Address: 10.96.0.10#53 Name: 10-244-1-101.default.pod.cluster.local Address: 10.244.1.101
References
-
[1] https://kubernetes.io/docs/concepts/cluster-administration/networking/
-
[2] https://kubernetes.io/docs/concepts/services-networking/
-
[3] https://kubernetes.io/docs/concepts/services-networking/service/
-
[4] https://kubernetes.io/docs/reference/networking/virtual-ips/
-
[5] https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/
-
[6] https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
-
https://www.kernel.org/doc/html/v5.12/networking/tuntap.html
-
https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels
-
https://www.stackrox.io/blog/kubernetes-networking-demystified/
-
https://sookocheff.com/post/kubernetes/understanding-kubernetes-networking-model/
-
https://www.vmware.com/topics/glossary/content/kubernetes-networking