Solving Kubernetes DNS issues on systemd servers

0 Feb 14, 2019 — https://sf.gl/1680

For a while, I've been seeing messages like these in the KubeDNS logs of a Kubernetes cluster. At the same time, I was also experiencing various DNS issues from the services running on the Kubernetes cluster.

dnsmasq[20]: Maximum number of concurrent DNS queries reached (max: 150)

At first, I tried to solve this by scaling up the number of KubeDNS pods:

kubectl scale deploy kube-dns --replicas=20 -nkube-system

This didn't change the amount of errors no matter how high I set the number of replicas.

systemd interferes with resolv.conf

What caused the issue was systemd shipping with its own name server, which interferes with the way KubeDNS expects DNS servers on your server to function.

On a default Ubuntu installation (18.04 or higher, I believe), the internal DNS server of systemd (127.0.0.53) is the only name server added to the /etc/resolv.conf file, since it then forwards to systemd's nameserver.

However, KubeDNS is dependent on the resolv.conf file to resolve names outside of Kubernetes.

This causes KubeDNS going into a loop, querying itself for every request, since 127.0.0.53 is a loopback IP.

We only saw the problem intermittently because some machines had been configured to have additional name servers than the local DNS server of systemd in resolv.conf, which mitigated the issue most of the time, so the error condition was only triggered randomly.

Solution: Use an alternate resolv.conf file

Luckily, systemd ships with an alternate resolv.conf file that actually contains a list of valid nameservers, and doesn't reference localhost. This file is very useful for scenarios like these where you do not want to use systemd's DNS server.

To use it with Kubernetes, you must specify an alternate resolv.conf path by utilizing the
--resolv-conf argument in your kubelet systemd service configuration.

Apply these steps on all Kubernetes nodes:

As root, edit /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
- If KUBELET_EXTRA_ARGS is not appearing in the file, add the following line:
```
Environment="KUBELET_EXTRA_ARGS=--resolv-conf=/run/systemd/resolve/resolv.conf"
```
- Otherwise, add
```
--resolv-conf=/run/systemd/resolve/resolv.conf
```
  to the end of the EXTRA_ARGS stanza.
Execute systemctl daemon-reload to apply the changes to the service configuration.
Restart kubelet to apply the changes: service kubelet restart
Finally, remove or restart all the KubeDNS pods on your system with this oneliner:
- ```
kubectl get po -nkube-system | grep 'kube-dns' | cut -d " " -f1 -| xargs -n1 -P 1 kubectl delete po
```
  (Fair warning: this restarts all kube-dns pods, but only 1 at a time. It shouldn't cause disruptions to your cluster, but use with care.)

Note: This was fixed in Kubernetes 1.11, which automatically detects this failure scenario. If you use 1.11 or above, you should not need to use this fix.

This entry was posted on Thursday, February 14th, 2019 at 17:11 and is filed under Articles. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Simon Fredsted

Content

Archives