My Tailscale DNS Woes

Introduction

I am an avid user of Tailscale, a mesh VPN based on Wireguard that makes secure, remote access to your homelab services incredibly easy. One of my favorite features is MagicDNS which essentially allows you to connect directly to any of your devices over Tailscale via the hostname rather than IP address. So, if you have an nginx web server running on port 8080 in Tailscale, you can reach it at http://nginx:8080 rather than having to use it’s Tailscale IP address – something like http://100.40.30.20:8080

I run, quite literally, almost everything in my Homelab over tailscale. I go into more detail about how I do this at this GitHub repo.

Recently, quite literally, everything in my Homelab stopped working.

Guess what?

It was DNS.

It’s always DNS.

What Happened?

One day, sometime last month, I noticed that my Ubuntu VMs were running a severely outdated kernel – version 5.15. I’m not sure why this was, as some of the installs were only a couple months old and Linux kernel version 6.0 was released years ago now. I think I’m using an outdated cloud-init image for them. Regardless, I updated the kernel on one of my machines and noticed that, suddenly, all of the services I was running in Docker (with Tailscale ‘sidecar’ containers) on that machine stopped working. I noticed the following in the logs of their Tailscale containers:

2024/12/26 12:07:20 Linux kernel version: 6.8.0-51-generic
2024/12/26 12:07:20 is CONFIG_TUN enabled in your kernel? `modprobe tun` failed with: modprobe: can't change directory to '/lib/modules': No such file or directory
2024/12/26 12:07:20 wgengine.NewUserspaceEngine(tun "tailscale0") error: tstun.New("tailscale0"): operation not permitted
2024/12/26 12:07:20 flushing log.
2024/12/26 12:07:20 logger closing down
2024/12/26 12:07:20 getLocalBackend error: createEngine: tstun.New("tailscale0"): operation not permitted
boot: 2024/12/26 12:08:20 Timed out waiting for tailscaled socket

So it appeared that since the new kernel update, tailscale containers could no longer manage kernel-level networking. The curious thing, is that Tailscale running on the host continues to work just fine.

At the time, my Ansible roles would provision tailscale containers with userspace networking set to false. This means (to my understanding) that the container modifies the kernel-level routing table in the same way that a VPN runnning on the host would. So, I swapped that default over to true and viola! Everything started working again.

Everything continued to work in userspace networking mode for several weeks. Then, one morning I ran OS updates and a docker system prune, and suddenly I could not connect to any of my homelab services.

“This is why you should never do updates… if it ain’t broke don’t fix it!” whispered the little devil on my shoulder as my brow started to sweat.

I quickly determined that the outage was caused by the Tailscale containers’ inability to communicate with each other over MagicDNS. Here’s a few reasons why that’s a problem:

Services talk to their databases over MagicDNS
Traefik-KOP talks to redis over MagicDNS
Authentication redirects are sent to Authentik over MagicDNS
Notifications are sent to NTFY via MagicDNS
My heart medication is sent over MagicDNS

Okay that last one was an exaggeration; I’m not even on heart medication (yet). But you get the point. When DNS breaks, everything breaks.

The “Problem”

The silver lining about this whole debacle was that I learned more about how DNS works in Linux than I ever signed up for. Here’s what I concluded as the ‘cause’ of the issues:

Inside of a tailscale container, if Tailscale’s magic DNS server (100.100.100.100) is not present as a nameserver entry in /etc/resolv.conf, then MagicDNS doesn’t work.

Now, according to Docker’s documentation, a container attached to the default bridge network (which my tailscale containers are), will copy the hosts /etc/resolv.conf file into the container. Meaning that whatever the host uses for DNS, the container will use that too.

My Tailscale containers, however, appear to be copying /run/systemd/resolve/resolv.conf rather than /etc/resolv.conf. Here’s an example of /etc/resolv.conf from inside one of my working Tailscale containers:

# Generated by Docker Engine.
# This file can be edited; Docker Engine will not make further changes once it
# has been modified.

nameserver 100.100.100.100
nameserver 192.168.0.50
search my-tailnet.ts.net

# Based on host file: '/run/systemd/resolve/resolv.conf' (legacy)
# Overrides: []

Now, /run/systemd/resolve/resolv.conf contains information about all known DNS servers across the system, with one major caveat: It is not aware of per-interface DNS settings and therefore only contains system-wide DNS settings. See this link to the Arch wiki for more info.

Why is this a problem? Well, because on the host, tailscale’s MagicDNS servers are configured via a per-interface setting. This sets 100.100.100.100 as a resolver for only the tailscale0 interface and resolves queries for the my-tailnet.ts.net domain. Here is an excerpt of the output of resolvectl status on one of my hosts:

Link 3 (tailscale0)
    Current Scopes: DNS
         Protocols: -DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 100.100.100.100
       DNS Servers: 100.100.100.100
        DNS Domain: my-tailnet.ts.net ~0.e.1.a.c.5.1.1.a.7.d.f.ip6.arpa ~100.100.in-addr.arpa ~101.100.in-addr.arpa ~102.100.in-addr.arpa
                    ~103.100.in-addr.arpa ~104.100.in-addr.arpa ~105.100.in-addr.arpa ~106.100.in-addr.arpa ~107.100.in-addr.arpa ~108.100.in-addr.arpa
                    ~109.100.in-addr.arpa ~110.100.in-addr.arpa ~111.100.in-addr.arpa ~112.100.in-addr.arpa ~113.100.in-addr.arpa ~114.100.in-addr.arpa
                    ~115.100.in-addr.arpa ~116.100.in-addr.arpa ~117.100.in-addr.arpa ~118.100.in-addr.arpa ~119.100.in-addr.arpa ~120.100.in-addr.arpa
                    ~121.100.in-addr.arpa ~122.100.in-addr.arpa ~123.100.in-addr.arpa ~124.100.in-addr.arpa ~125.100.in-addr.arpa ~126.100.in-addr.arpa
                    ~127.100.in-addr.arpa ~64.100.in-addr.arpa ~65.100.in-addr.arpa ~66.100.in-addr.arpa ~67.100.in-addr.arpa ~68.100.in-addr.arpa
                    ~69.100.in-addr.arpa ~70.100.in-addr.arpa ~71.100.in-addr.arpa ~72.100.in-addr.arpa ~73.100.in-addr.arpa ~74.100.in-addr.arpa
                    ~75.100.in-addr.arpa ~76.100.in-addr.arpa ~77.100.in-addr.arpa ~78.100.in-addr.arpa ~79.100.in-addr.arpa ~80.100.in-addr.arpa
                    ~81.100.in-addr.arpa ~82.100.in-addr.arpa ~83.100.in-addr.arpa ~84.100.in-addr.arpa ~85.100.in-addr.arpa ~86.100.in-addr.arpa
                    ~87.100.in-addr.arpa ~88.100.in-addr.arpa ~89.100.in-addr.arpa ~90.100.in-addr.arpa ~91.100.in-addr.arpa ~92.100.in-addr.arpa
                    ~93.100.in-addr.arpa ~94.100.in-addr.arpa ~95.100.in-addr.arpa ~96.100.in-addr.arpa ~97.100.in-addr.arpa ~98.100.in-addr.arpa
                    ~99.100.in-addr.arpa ~ts.net

Even still, if the Tailscale container copies the hosts /run/systemd/resolve/resolve.conf file into it’s own /etc/resolv.conf, according to their documenation, it should either interoperate with a DNS manager (like resolvectl, which isn’t present inside the container) or it will overwrite /etc/resolv.conf. So, the container should be adding 100.100.100.100 as a nameserver entry into its /etc/resolv.conf file, but it isn’t. It’s dependent on the host already having that entry in /run/systemd/resolve/resolv.conf.

Well, technically it is overwriting it… The first thing I tried was adding the nameserver entry to /etc/resolv.conf manually by execing into the container. This works temporarily but the container quickly overwrites the file back to its original state.

The “Solution”

This is where the “learning way more about DNS in Linux than I signed up for” part of the story comes in. How do I get 100.100.100.100 added to the hosts /run/systemd/resolve/resolv.conf?

Now, describing how DNS works on a linux system is deserving of it’s own dedicated post, so I’ll spare the details. Essentially, nameservers are either configured explicitly in the /etc/resolv.conf, or the stub-listener is used.

If the stub-listener is used, there will be a local address – either 127.0.0.53 or 127.0.0.54 – configured as the only nameserver in /etc/resolv.conf. This stub-listener acts like an API, directing DNS requests to your configured nameservers which are defined elsewhere.

Now, trying to figure out where those nameservers are configured, requires looking in (what feels like) about 50 different places. There are a number of config files that nameservers could be defined in, and a number of software utilities that could be managing them (a DHCP client, for example). In my host’s case they are defined via the resolvectl utility. Chances are there’s a comment in your /etc/resolv.conf that will give you a hint. In my case:

# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.

Running the aforementioned command will show that 100.100.100.100 is indeed defined as a nameserver for the tailscale0 interface, as we saw previously.

But, that means that to add 100.100.100.100 as a nameserver entry into /run/systemd/resolve/resolv.conf I’ll need to configure that nameserver globally somehow. How do we do that?

Well, here’s where I’m honestly still a little confused. I found that running the following command made the DNS entries appear in run/systmed/resolve/resolv.conf:

resolvectl dns eth0 100.100.100.100 192.168.0.50

Breaking the command down:

resolvectl is invoking the resolvectl utility
dns is making a dns configuration
eth0 is making the change for the eth0 interface… sounds like a per-interface DNS setting right?
100.100.100.100 192.168.0.50 is adding two nameserver addresses, the first being Tailscale’s MagicDNS server.

Now, as far as I can tell, this is a per-interface setting. But when I do this, it adds 100.100.100.100 to /run/systemd/resolve/resolv.conf on the host and suddenly the containers are all happy again.

Annoyingly, this now adds 100.100.100.100 as a nameserver entry onto my primary interface. This means that if Tailscale on that host ever goes down, it will still be attempting to send queries to 100.100.100.100 which is only accessible via Tailscale.

Additionally, although I don’t currently use any internal DNS entries on my PiHole (the second nameserver – 192.168.0.50 for this example), but if I did, then I suspected this configuration could cause issues. I tested this by adding the following internal DNS entry to my PiHole:

test.resolve.internal = 192.168.0.53

Then ran nslookup on one of the hosts:

[josh@vm-fedora ~]$ nslookup test.resolve.internal
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
Name:	test.resolve.internal
Address: 192.168.0.53

So it appears it isn’t an issue… at least for now.

Conclusion

This feels like a band-aid workaround to me and I still suspect that there is a bug with Tailscale’s container build, as the container shouldn’t be dependent on the host’s awareness of the 100.100.100.100 nameserver. A Tailscale container should work even without Tailscale installed on the host.

I’ve submitted a bug report – https://github.com/tailscale/tailscale/issues/14467, which will hopefully either promote some discourse around it, or someone will point out what I’m doing wrong. If you’re reading this, and you know what I’m doing wrong, feel free to reach out to me on social media! I’d love to exchange some nerd talk.