Calico is another example of a full-blown Kubernetes “networking solution” with functionality including network policy controller, kube-proxy replacement and network traffic observability. CNI functionality is still the core element of Calico and the focus of this chapter will be on how it satisfies the Kubernetes network model requirements.

  • Connectivity is set up by creating a veth link and moving one side of that link into a Pod’s namespace. The other side of the link is left dangling in the node’s root namespace. For each local Pod, Calico sets up a PodIP host-route pointing over the veth link.

One oddity of Calico CNI is that the node end of the veth link does not have an IP address. In order to provide Pod-to-Node egress connectivity, each veth link is set up with proxy_arp which makes root NS respond to any ARP request coming from the Pod (assuming that the node has a default route itself).

  • Reachability can be established in two different ways:

    1. Static routes and overlays – Calico supports IPIP and VXLAN and has an option to only setup tunnels for traffic crossing the L3 subnet boundary.

    2. BGP – the most popular choice for on-prem deployments, it works by configuring a Bird BGP speaker on every node and setting up peerings to ensure that reachability information gets propagated to every node. There are several options for how to set up this peering, including full-mesh between nodes, dedicated route-reflector node and external peering with the physical network.

The above two modes are not mutually exclusive, BGP can be used with IPIP in public cloud environments. For a complete list of networking options for both on-prem and public cloud environments, refer to this guide.

For demonstration purposes, we’ll use a BGP-based configuration option with external off-cluster route-reflector. The fully converged and populated IP and MAC tables will look like this:


Assuming that the lab environment is already set up, calico can be enabled with the following commands:

make calico 

Check that the calico-node daemonset has all pods in READY state:

$ kubectl -n calico-system get daemonset
calico-node   3         3         3       3            3    61s

Now we need to “kick” all Pods to restart and pick up the new CNI plugin:

make nuke-all-pods

To make sure kube-proxy and calico set up the right set of NAT rules, existing NAT tables need to be flushed and re-populated:

make flush-nat && make calico-restart

Build and start a GoBGP-based route reflector:

make gobgp-build && make gobgp-rr

Finally, reconfigure Calico’s BGP daemonset to peer with the GoBGP route reflector:

make gobgp-calico-patch 

Here’s how the information from the diagram can be validated (using worker2 as an example):

  1. Pod IP and default route
$ NODE=k8s-guide-worker2 make tshoot
bash-5.0# ip -4 -br addr show dev eth0
[email protected]         UP    

bash-5.0# ip route
default via dev eth0 dev eth0 scope link 

Note how the default route is pointing to the fake next-hop address This will be the same for all Pods and this IP will resolve to the same MAC address configured on all veth links:

bash-5.0# ip neigh dev eth0 lladdr ee:ee:ee:ee:ee:ee REACHABLE
  1. Node’s routing table
$ docker exec k8s-guide-worker2 ip route
default via dev eth0 via dev eth0 proto bird dev calid7f7f4e15dd scope link 
blackhole proto bird dev calid599cd3d268 scope link dev cali82aeec08a68 scope link dev calid2e34ad38c6 scope link dev cali4a822ce5458 scope link dev cali0ad20b06c15 scope link via dev eth0 proto bird dev eth0 proto kernel scope link src 

A few interesting things to note in the above output:

  • The 2 x /24 routes programmed by bird are the PodCIDR ranges of the other two nodes.
  • The blackhole /24 route is the PodCIDR of the local node.
  • Inside the local PodCIDR there’s a /32 host-route configured for each running Pod.
  1. BGP RIB of the GoBGP route reflector
docker exec gobgp gobgp global rib

   Network              Next Hop             AS_PATH              Age        Attrs
*>                                00:05:04   [{Origin: i} {LocalPref: 100}]
*>                                00:05:04   [{Origin: i} {LocalPref: 100}]
*>                                00:05:03   [{Origin: i} {LocalPref: 100}]

A day in the life of a Packet

Let’s track what happens when Pod-1 (actual name is net-tshoot-rg2lp) tries to talk to Pod-3 (net-tshoot-6wszq).

We’ll assume that the ARP and MAC tables are converged and fully populated. In order to do that issue a ping from Pod-1 to Pod-3’s IP (

  1. Check the peer interface index of the veth link of Pod-1:
$ kubectl -n default exec net-tshoot-rg2lp -- ip -br addr show dev eth0
3: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP mode DEFAULT group default 
    link/ether b2:24:13:ec:77:42 brd ff:ff:ff:ff:ff:ff link-netnsid 0

This information (if14) will be used in step 2 to identify the node side of the veth link.

  1. Pod-1 wants to send a packet to Its network stack performs a route lookup:
$ kubectl -n default exec net-tshoot-rg2lp -- ip route get via dev eth0 src uid 0 
  1. The nexthop IP is on eth0, ARP table lookup is needed to get the destination MAC:
$ kubectl -n default exec net-tshoot-rg2lp -- ip neigh show dev eth0 lladdr ee:ee:ee:ee:ee:ee STALE

As mentioned above, the node side of the veth link doesn’t have any IP configured:

$ docker exec k8s-guide-worker ip addr show dev if14       
14: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP group default 
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-262ff521-1b00-b1c9-f0d5-0943a48a2ddc

So in order to respond to an ARP request for, all veth links have proxy ARP enabled:

$ docker exec k8s-guide-worker cat /proc/sys/net/ipv4/conf/calic8441ae7134/proxy_arp
  1. The packet reaches the root namespace of the ingress node, where another L3 lookup takes place:
$ docker exec k8s-guide-worker ip route get fibmatch via dev eth0 proto bird 
  1. The packet is sent to the target node where another FIB lookup is performed:
$ docker exec k8s-guide-control-plane ip route get fibmatch dev cali0ec6986a945 scope link

The target IP is reachable over the veth link so ARP is used to determine the destination MAC address:

docker exec k8s-guide-control-plane ip neigh show dev cali0ec6986a945 lladdr de:85:25:60:86:5b STALE
  1. Finally, the packet gets delivered to the eth0 interface of the target pod:
kubectl exec net-tshoot-6wszq -- ip -br addr show dev eth0
[email protected]         UP    fe80::dc85:25ff:fe60:865b/64 

SNAT functionality

SNAT functionality for traffic egressing the cluster is done in two stages:

  1. cali-POSTROUTING chain is inserted at the top of the POSTROUTING chain.

  2. Inside that chain cali-nat-outgoin is SNAT’ing all egress traffic originating from cali40masq-ipam-pools.

iptables -t nat -vnL
Chain POSTROUTING (policy ACCEPT 5315 packets, 319K bytes)
 pkts bytes target     prot opt in     out     source               destination         
 7844  529K cali-POSTROUTING  all  --  *      *              /* cali:O3lYWMrLQYEMJtB5 */ 
Chain cali-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination         
 7844  529K cali-fip-snat  all  --  *      *              /* cali:Z-c7XtVd2Bq7s_hA */
 7844  529K cali-nat-outgoing  all  --  *      *              /* cali:nYKhEzDlr11Jccal */
Chain cali-nat-outgoing (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    1    84 MASQUERADE  all  --  *      *              /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully

Calico configures all IPAM pools as ipsets for a more efficient matching within iptables. These pools can be viewed on each individual node:

$ docker exec k8s-guide-control-plane ipset -L cali40masq-ipam-pools
Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 512
References: 1
Number of entries: 1

Caveats and Gotchas

  • Calico support GoBGP-based routing, but only as an experimental feature.
  • BGP configs are generated from templates based on the contents of the Calico datastore. This makes the customization of the generated BGP config very problematic.