IPVS

IPTables has been the first implementation of kube-proxy’s dataplane, however, over time its limitations have become more pronounced, especially when operating at scale. There are several side-effects of implementing a proxy with something that was designed to be a firewall, the main one being a limited set of data structures. The way it manifests itself is that every ClusterIP Service needs to have a unique entry, these entries can’t be grouped and have to be processed sequentially as chains of tables. This means that any dataplane lookup or a create/update/delete operation needs to traverse the chain until a match is found which, at a large-enough scale can result in minutes of added processing time.

Detailed performance analysis and measurement results of running iptables at scale can be found in the Additional Reading section at the bottom of the page.

All this led to ipvs being added as an enhancement proposal and eventually graduating to GA in Kubernetes version 1.11. The new dataplane implementation offers a number of improvements over the existing iptables mode:

  • All Service load-balancing is migrated to IPVS which can perform in-kernel lookups and masquerading in constant time, regardless of the number of configured Services or Endpoints.

  • The remaining rules in IPTables have been re-engineered to make use of ipset, making the lookups more efficient.

  • Multiple additional load-balancer scheduling modes are now available, with the default one being a simple round-robin.

On the surface, this makes the decision to use ipvs an obvious one, however, since iptables have been the default mode for so long, some of its quirks and undocumented side-effects have become the standard. One of the fortunate side-effects of the iptables mode is that ClusterIP is never bound to any kernel interface and remains completely virtual (as a NAT rule). So when ipvs changed this behaviour by introducing a dummy kube-ipvs0 interface, it made it possible for processes inside Pods to access any host-local services bound to 0.0.0.0 by targeting any existing ClusterIP. Although this does make ipvs less safe by default, it doesn’t mean that these risks can’t be mitigated (e.g. by not binding to 0.0.0.0).

The diagram below is a high-level and simplified view of two distinct datapaths for the same ClusterIP virtual service – one from a remote Pod and one from a host-local interface.

Lab Setup

Assuming that the lab environment is already set up, ipvs can be enabled with the following command:

make ipvs

Under the covers, the above command updates the proxier mode in kube-proxy’s ConfigMap so in order for this change to get picked up, we need to restart all of the agents and flush out any existing iptable rules:

make flush-nat

Check the logs to make sure kube-proxy has loaded all of the required kernel modules. In case of a failure, the following error will be present in the logs and kube-proxy will fall back to the iptables mode:

$ make kube-proxy-logs | grep -i ipvs
E0626 17:19:43.491383       1 server_others.go:127] Can't use the IPVS proxier: IPVS proxier will not be used because the following required kernel modules are not loaded: [ip_vs ip_vs_rr ip_vs_wrr ip_vs_sh]

Another way to confirm that the change has succeeded is to check that Nodes now have a new dummy ipvs device:

$ docker exec -it k8s-guide-worker ip link show kube-ipvs0 
7: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
    link/ether 22:76:01:f0:71:9f brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 0
    dummy addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

One thing to remember when migrating from iptables to ipvs on an existing cluster (as opposed to rebuilding it from scratch), is that all of the KUBE-SVC/KUBE-SEP chains will still be there at least until they cleaned up manually or a node is rebooted.

Spin up a test deployment and expose it as a ClusterIP Service:

kubectl create deploy web --image=nginx --replicas=2
kubectl expose deploy web --port 80

Check that all Pods are up and note the IP allocated to our Service:

$ kubectl get pod -owide -l app=web
NAME                  READY   STATUS    RESTARTS   AGE    IP           NODE                NOMINATED NODE   READINESS GATES
web-96d5df5c8-6bgpr   1/1     Running   0          111s   10.244.1.6   k8s-guide-worker    <none>           <none>
web-96d5df5c8-wkfrb   1/1     Running   0          111s   10.244.2.4   k8s-guide-worker2   <none>           <none>
$ kubectl get svc web
NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
web    ClusterIP   10.96.119.228   <none>        80/TCP    92s

Before we move forward, there are a couple of dependencies we need to satisfy:

  1. Pick one of the Nodes hosting a test deployment and install the following packages:
docker exec k8s-guide-worker apt update 
docker exec k8s-guide-worker apt install ipset ipvsadm -y
  1. On the same Node set up the following set of aliases to simplify access to iptables, ipvs and ipset:
alias ipt="docker exec k8s-guide-worker iptables -t nat -nvL"
alias ipv="docker exec k8s-guide-worker ipvsadm -ln"
alias ips="docker exec k8s-guide-worker ipset list"

Use case #1: Pod-to-Service communication

Any packet leaving a Pod will first pass through the PREROUTING chain which is where kube-proxy intercepts all Service-bound traffic:

$  ipt PREROUTING
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
  128 12020 KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
    0     0 DOCKER_OUTPUT  all  --  *      *       0.0.0.0/0            192.168.224.1

The size of the KUBE-SERVICES chain is reduced compared to the iptables mode and the lookup stops once the destination IP is matched against the KUBE-CLUSTER-IP ipset:

$ ipt KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *      !10.244.0.0/16        0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
    0     0 KUBE-NODE-PORT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

This ipset contains all existing ClusterIPs and the lookup is performed in O(1) time:

$ ips KUBE-CLUSTER-IP
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 768
References: 2
Number of entries: 9
Members:
10.96.0.10,udp:53
10.96.0.1,tcp:443
10.96.0.10,tcp:53
10.96.148.225,tcp:80
10.96.68.46,tcp:3030
10.96.10.207,tcp:3030
10.96.0.10,tcp:9153
10.96.159.35,tcp:11211
10.96.119.228,tcp:80

Following the lookup in the PREROUTING chain, our packet gets to the routing decision stage which is where it gets intercepted by Netfilter’s NF_INET_LOCAL_IN hook and redirected to IPVS.

$ ipv
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.224.4:31730 rr
  -> 10.244.1.6:80                Masq    1      0          0
  -> 10.244.2.4:80                Masq    1      0          0
TCP  10.96.0.1:443 rr
  -> 192.168.224.3:6443           Masq    1      0          0
TCP  10.96.0.10:53 rr
  -> 10.244.0.3:53                Masq    1      0          0
  -> 10.244.0.4:53                Masq    1      0          0
TCP  10.96.0.10:9153 rr
  -> 10.244.0.3:9153              Masq    1      0          0
  -> 10.244.0.4:9153              Masq    1      0          0
TCP  10.96.10.207:3030 rr
  -> 10.244.1.4:3030              Masq    1      0          0
TCP  10.96.68.46:3030 rr
  -> 10.244.2.2:3030              Masq    1      0          0
TCP  10.96.119.228:80 rr
  -> 10.244.1.6:80                Masq    1      0          0
  -> 10.244.2.4:80                Masq    1      0          0
TCP  10.96.148.225:80 rr
  -> 10.244.1.6:80                Masq    1      0          0
  -> 10.244.2.4:80                Masq    1      0          0
TCP  10.96.159.35:11211 rr
  -> 10.244.1.3:11211             Masq    1      0          0
TCP  10.244.2.1:31730 rr
  -> 10.244.1.6:80                Masq    1      0          0
  -> 10.244.2.4:80                Masq    1      0          0
TCP  127.0.0.1:31730 rr
  -> 10.244.1.6:80                Masq    1      0          0
  -> 10.244.2.4:80                Masq    1      0          0
UDP  10.96.0.10:53 rr
  -> 10.244.0.3:53                Masq    1      0          8
  -> 10.244.0.4:53                Masq    1      0          8

This is where the packet gets DNAT’ed to the IP of one of the selected backend Pods (10.244.1.6 in our case) and continues on to its destination unmodified, following the forwarding path built by a CNI plugin.

Use case #2: Any-to-Service communication

Any host-local service trying to communicate with a ClusterIP will first get its packet through OUTPUT and KUBE-SERVICES chains:

$ ipt OUTPUT
Chain OUTPUT (policy ACCEPT 5 packets, 300 bytes)
 pkts bytes target     prot opt in     out     source               destination
 1062 68221 KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
  287 19636 DOCKER_OUTPUT  all  --  *      *       0.0.0.0/0            192.168.224.1

Since source IP does not belong to the PodCIDR range, our packet gets a de-tour via the KUBE-MARK-MASQ chain:

$ ipt KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *      !10.244.0.0/16        0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
    0     0 KUBE-NODE-PORT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst

Here the packet gets marked for future SNAT, to make sure it will have a return path from the Pod:

$ ipt KUBE-MARK-MASQ
Chain KUBE-MARK-MASQ (13 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK or 0x4000

The following few steps are exactly the same as described for the previous use case:

  • The packet reaches the end of the KUBE-SERVICES chain.
  • The routing lookup returns a local dummy ipvs interface.
  • IPVS intercepts the packet and performs the backend selection and NATs the destination IP address.

The modified packet metadata continues along the forwarding path until it hits the egress veth interface where it gets picked up by the POSTROUTING chain:

$ ipt POSTROUTING
Chain POSTROUTING (policy ACCEPT 5 packets, 300 bytes)
 pkts bytes target     prot opt in     out     source               destination
 1199 80799 KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
    0     0 DOCKER_POSTROUTING  all  --  *      *       0.0.0.0/0            192.168.224.1
  920 61751 KIND-MASQ-AGENT  all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type !LOCAL /* kind-masq-agent: ensure nat POSTROUTING directs all non-LOCAL destination traffic to our custom KIND-MASQ-AGENT chain */

This is where the source IP of the packet gets modified to match the one of the egress interface, so the destination Pod knows where to send a reply:

$ ipt KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* Kubernetes endpoints dst ip:port, source ip for solving hairpin purpose */ match-set KUBE-LOOP-BACK dst,dst,src
    1    60 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    0     0 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

The final masquerading action is performed if the destination IP and Port match one of the local Endpoints which are stored in the KUBE-LOOP-BACK ipset:

$ ips KUBE-LOOP-BACK
Name: KUBE-LOOP-BACK
Type: hash:ip,port,ip
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 360
References: 1
Number of entries: 2
Members:
10.244.1.2,tcp:3030,10.244.1.2
10.244.1.6,tcp:80,10.244.1.6

It should be noted that, similar to the iptables mode, all of the above lookups are only performed for the first packet of the session and all subsequent packets follow a much shorter path in the conntrack subsystem.

Additional reading

Scaling Kubernetes to Support 50,000 Services

Comparing kube-proxy modes: iptables or IPVS?

IPVS-Based In-Cluster Load Balancing Deep Dive