Creating a Kubernetes Cluster With Talos Linux on Tailscale

Note: My working Homelab-as-Code github repository can be found at

https://github.com/joshrnoll/homelab-as-code

Introduction

In my previous post I outlined my plan to build a Homelab-as-Code architecture. This post will document step #1 of this plan which is to deploy my Kubernetes clusters.

I’ll be using Talos Linux for this – a minimal distribution with just enough OS to run Kubernetes. Talos is so lightweight that it doesn’t even have a shell. You can only interact with a Talos machine through the Talos API using talosctl.

And, of course, I’ll be putting my clusters on Tailscale!

Getting the Talos ISO

The first thing we need to do is get an ISO file for Talos, which can be downloaded from their GitHub releases page. The Talos documentation will have a link to it here.

At the time of this writing, the link is for Talos version 1.9. You should see a notification at the top of the page prompting you with a link to the latest version if this one is out of date.

I’m using Proxmox, so I need to upload the ISO to my node’s storage so that I can create VM’s with it. Click on the node –> local (or whatever storage you’re using) –> ISO Images –> Upload.

I have a three-node cluster and don’t currently have a shared storage medium that hold’s ISOs, so I need to do this three times… YMMV.

Creating a VM Template and Deploy VMs

Once the ISO is uploaded we need to create a virtual machine template. Right click on the desired node and select create VM. Enter the necessary information as shown below. Modify everything for your environment, of course.

After the VM is created – right click on it and select convert to template.

Be aware of the hardware requirements for Talos. You’ll need a minimum of 2GB of memory and 2 cores for a control plane node and 1GB of memory and 1 core for a worker node.

I imagine just about any environment can handle that!

Note: I forgot to grab a screenshot of it, but you’ll want to ensure that the QEMU Guest Agent is enabled during the VM creation wizard. You can enable it later if you forget, but it will require rebooting the VM from the Proxmox Web UI to take effect.

Configure Talos

Once you have a template created, you’ll want to clone it to create your first Talos node. I’ll be creating two clusters:

One single node cluster for staging/testing new services and configs
One ‘production’ cluster with 3 control plane nodes and 3 worker nodes.

I recommend this article for understanding how many nodes to put in your control plane. It’s not as simple as “more is better” for availability. The TL;DR is:

Control planes with an even number of nodes are less highly available than control planes with an odd number of nodes.
The more nodes you add, the higher latency you’ll have when interacting with the Kubernetes API.

With the nodes booted up, we’re ready to generate their configuration files.

Get Schematic ID From Talos Factory

Since I plan on using Tailscale and, eventually, Longhorn for persistent storage, there is some software that I’ll need installed on my nodes. But since Talos doesn’t even have a shell, how do we install software? That’s where the Talos image factory comes in.

Go to https://factory.talos.dev to create a custom image schematic. Go through the wizard, selecting your hardware type, Talos version and machine architecture.

When you get to system extensions, select the following:

siderolabs/tailscale – For connecting to Tailscale
siderolabs/iscsi-tools – Required for Longhorn storage
siderlolabs/util-linux-tools – Also required for Longhorn
siderolabs/qemu-guest-agent – So that the Proxmox dashboard can display information about the VM such as hostname and IP.

Skip the extra kernal arguments, and you should be presented with a schematic ID and a config that looks something like this:

customization:
    systemExtensions:
        officialExtensions:
            - siderolabs/iscsi-tools
            - siderolabs/qemu-guest-agent
            - siderolabs/tailscale
            - siderolabs/util-linux-tools

We don’t actually need to save this config, but if you do, it will make it easier to retrieve your schematic ID later if you ever need it again. If you save the above config to a file, for example schematic.yaml, you can run the following curl command to retrieve the schematic ID:

curl -X POST --data-binary @schematic.yaml https://factory.talos.dev/schematics
{"id":"077514df2c1b6436460bc60faabc976687b16193b8a1290fda4366c69024fec2"}

In my situation, all I really need to do is construct an install image URL from this schematic ID. It will look like this:

factory.talos.dev/installer/{{ schematic ID }}:{{ talos version }}

for example:

factory.talos.dev/installer/077514df2c1b6436460bc60faabc976687b16193b8a1290fda4366c69024fec2:v1.9.4

Keep this URL handy, we’ll need it when we generate the configs with talosctl.

Note: You may notice the Talos factory will also produce a link to an ISO for you to download. This ISO will contain the software listed above on first boot. You can choose to dowload this ISO and simply boot your machines off of it rather than passing the above URL to the talosctl gen config command as we will do in the next step.

Generate Talos Configs

Once you have a Talos node booted up, in its console will be a screen that looks like this:

If you try to log in or interact with the VM via command line in any way, you won’t be able to. You can only interact with Talos via an API.

I recommend leaving this console up in another window while running the remaining steps so that you can see the changes made in real time.

Install talosctl

First, we’ll need to install the talosctl command-line utility which will allow us to interact with the Talos API. The easiest way to install it is with homebrew:

brew install siderolabs/tap/talosctl

You can view the Talos documentation for alternative installation methods.

Generate Configs

Next, we need to generate the config files from the endpoint URL of our first Talos node. We’ll also pass an additional argument telling talos to use the custom image URL we generated from the Talos factory earlier.

From the console above you can see the node’s IP. Generally, you would use this IP to construct the talosctl gen config command. However, since we are going to be immediately applying a machine patch which changes this IP to a static one (as well as installing Tailscale, which will provide us yet another IP/hostname to connect to), we are going to anticipate our endpoint URL.

The command will look like this:

talosctl gen config <cluster-name> https://<endpoint-ip-or-hostname>:6443 --install-image=factory.talos.dev/installer/58e4656b31857557c8bad0585e1b2ee53f7446f4218f3fae486aa26d4f6470d8:v1.9.4

Replace cluster-name with a name of your choice, such as “production” or “staging”.

The endpoint will depend on the cluster, which I will cover next.

Note: This command needs to be run per cluster. So, in my case, I will run it once for staging and once for production. This will give me two sets of config files.

The command should generate three files into your current working directory:

controlplane.yaml
worker.yaml
talosconfig

Staging Cluster Endpoint

For my staging cluster, I will use the node’s hostname (in my case, talos-staging). Tailscale’s Magic DNS will resolve this and I will be able to reach this endpoint from anywhere over Tailscale. So, for my staging cluster, my gen config command will look like this:

talosctl gen config staging https://talos-staging:6443 --install-image=factory.talos.dev/installer/58e4656b31857557c8bad0585e1b2ee53f7446f4218f3fae486aa26d4f6470d8:v1.9.4

Production Cluster Endpoint

For my production cluster, this is a bit trickier… since I will have multiple control plane nodes. I don’t want to just pick one of the control plane nodes and use that as an endpoint, since that won’t be a highly available control plane. If that node goes down, I will lose access to the Kubernetes API (ie. kubectl will stop working).

Talos provides three examples for an HA control plane endpoint:

Using a dedicated load balancer: I could potentially use Traefik for this, but that becomes its own single point of failure
Creating multiple DNS records with the same hostname: Since I can’t manually create entries in Tailscale’s Magic DNS, this option is out. I don’t want to use a self-hosted DNS solution for this.
Layer 2 Virtual IP: This option is the most attractive, but it presents a challenge. I want my control plane available over Tailscale, but this option requires a local IP to be the endpoint rather than a Tailscale IP.

For now, it would seem that the virtual IP (VIP) is the best option. It won’t be available over Tailscale natively, but I can set up a Tailscale subnet router to make that local IP available whenever I’m connected to Tailscale, effectively achieving my goal anyway.

So, in an ideal world I would generate my config using this VIP. However, since the VIP won’t exist until after Kubernetes has been bootstrapped, this will cause a problem when we go to bootstrap Kubernetes later; Kubernetes can’t bootstrap off an endpoint that doesn’t exist yet.

So, we’ll have to generate configs using the first control plane node’s IP, at least initially. Then, after Kubernetes has been bootstrapped, we can change the endpoint in the controlplane.yaml and worker.yaml files back to the VIP.

So, my gen config command for my production cluster will look like this (where 10.0.30.80 is the static IP of my first controlplane node):

talosctl gen config production https://10.0.30.80:6443 --install-image=factory.talos.dev/installer/58e4656b31857557c8bad0585e1b2ee53f7446f4218f3fae486aa26d4f6470d8:v1.9.4

Create Patch Files

Next, we’ll create our machine-specific patch files containing the node’s IP and hostname. We’ll also create a patch file for our Tailscale configuration.

Machine-Specific Patch Files

In a file called hostname.patch.yaml (where hostname is the desired machine’s hostname), we’ll add the following:

machine:
  # Static IP Configuration
  network:
    hostname: talos-staging
    interfaces:
      - deviceSelector:
          physical: true 
        addresses:
          - 10.0.30.86/24
        routes:
          - network: 0.0.0.0/0 # The route's network.
            gateway: 10.0.30.1 # The route's gateway.
        dhcp: false
  # Mounts Required for Longhorn on Worker Nodes
  # Remove This Section for Control Plane Nodes!
  # (Leave for Single Node Cluster)
  kubelet:
    extraMounts:
      - destination: /var/lib/longhorn
        type: bind
        source: /var/lib/longhorn
        options:
          - bind
          - rshared
          - rw

Modify the hostname and IP settings for your use case. The extraMounts section should only be necessary on worker nodes. The exception being a single node cluster, because that node will run the control plane but also run workloads and require Longhorn storage.

Note the deviceSelector section. This is applying the configuration to all physical interfaces. You may need to adjust this to your use case. In my case, my nodes only have one interface, so this saves me the headache of trying to figure out what the interface name is.

Tailscale Patch File

Before creating the Tailscale patch file, you’ll need to generate an Authkey. To do this, go to your Tailscale admin console, go to settings –> Keys (located under personal settings).

Set the key to be reusable
I recommend leaving the ephemeral setting selected. This will allow a node to be automatically removed from your Tailnet when it comes offline, preventing duplicate machine entries in Tailscale.
I also recommend adding a tag. This helps keep things organized and allows for configuring RBAC in the Tailscale ACL. If you don’t see any tags listed, you’ll need to create them first in the Tailnet ACL file.

Note that the Authkey is only good for a maximum of 90 days. This does not mean that your nodes will be removed from the Tailnet after 90 days. It only means that the key can no longer be used to add additional nodes after 90 days. At which point, you would need to generate a new one.

Once you have your Authkey (I recommend saving it in a password manager) we can create the Tailscale patch file.

In a file called tailscale.patch.yaml add the following contents:

apiVersion: v1alpha1
kind: ExtensionServiceConfig
name: tailscale
environment:
  - TS_AUTHKEY=tskey-auth-your-authkey-here

This will configure the Tailscale extension with your authkey, allowing nodes to authenticate into your Tailnet.

Modify Configs for Staging and Production

At this point you should have at least five total files per cluster:

hostname.patch.yaml (you will have one of these per node)
tailscale.patch.yaml
controlplane.yaml
worker.yaml
talosconfig

Note: These files have secrets in them! Don’t commit these to a public repo without using some kind of secret management tool.

In my case, there are a couple of tweaks I will need to make to the controlplane.yaml files for both the staging and the production clusters.

Staging Modifications

Since my staging cluster is going to be a single node cluster, I need to add a line in the configuration to allow workloads to be scheduled on controlplane nodes.

In controlplane.yaml, find the cluster section and add the following:

cluster:
    allowSchedulingOnControlPlanes: true

This is documented in the Talos documentation here.

Production Modifications

For my production cluster, I will need to add the VIP configuration. Talos offers a built-in feature for control plane nodes to share a virtual IP. The only requirements are that the nodes are on the same layer 2 subnet, and that your selected VIP is also from that same subnet.

The configuration looks like this. Search for the network settings in controlplane.yaml.

network:
  interfaces:
    - deviceSelector:
        physical: true
      dhcp: false # Set to false since machine patch will provide static IP
      vip:
        ip: 10.0.30.25 # <-- ensure this is on the same subnet as your controlplane nodes!

Apply the Configs

We’re finally ready to apply our configs. The command will look like this

talosctl apply-config --insecure -f controlplane.yaml -n 10.0.30.168 -e 10.0.30.168 --config-patch @talos-staging.patch.yaml --config-patch @tailscale.patch.yaml

Note that the IP we’re using is the IP we saw in the machine’s console earlier. This is the IP it received from the DHCP server and should be the last time you need to specify this IP. Going forward, you will use the machine’s static IP that you specified in the machine’s patch file.

The --insecure flag is only necessary the first time you run this command (for each given node). This is because the node is still in maintenance mode and doesn’t have a configured certificate authority yet.

There are two --config-patch flags specifying the two patch files we created earlier. These can be absolute or relative paths to the file. What this does is applies the general configuration (from controlplane.yaml or worker.yaml), which doesn’t contain configs for the machine’s hostname, IP and the machine’s Tailscale extension patch, and immediately patches those configurations into the controlplane.yaml or worker.yaml configs. This prevents us from having to maintain separate, enormous controlplane.yaml and worker.yaml files for each node.

You’ll be running this command against each node, modifying the IP as necessary and swapping out controlplane.yaml for worker.yaml for worker nodes. For now though, run this against your first control plane node only until after you’ve bootstrapped the cluster.

Once you apply the config, you should see the IP and hostname change in the Talos console and the node will likely reboot. Within a few minutes, the node should appear in your Tailscale admin console:

Bootstrap the Cluster

With our configs applied, we are ready to bootstrap Kubernetes. This only has to be done once, against the first control plane node!

Run the following command (note that we are now specifying the machine’s static IP, and we are not using the –insecure flag!):

talosctl bootstrap --nodes 10.0.30.86 --endpoints 10.0.30.86 --talosconfig=./talosconfig

If you run into any errors when running this command, ensure you are running it from the directory that contains the talosconfig file. Otherwise, you can modify the --talosconfig flag to specify the relative/absolute path to the file.

You can also export the path to the $TALOSCONFIG environment variable and omit the --talosconfig flag altogether. Example:

export $TALOSCONFIG=/path/to/talosconfig

For my Production Cluster:

At this point, with Kubernetes bootstrapped on my production cluster, I’ll want to change the following line in both controlplane.yaml and worker.yaml to my VIP:

cluster:
    id: <super-secret> # Globally unique identifier for this cluster (base64 encoded random 32 bytes).
    secret: <super-secret> # Shared secret of cluster (base64 encoded random 32 bytes).
    # Provides control plane specific configuration options.
    controlPlane:
        endpoint: https://10.0.30.25:6443 # Change this to the VIP!!
    clusterName: production # Configures the cluster's name.

Then, I’ll need to re-run the commands from earlier to apply this config. At this point we can omit the --insecure flag. Don’t forget the patches!

Get Your Kubeconfig File

Finally, we just need to get our kubeconfig from Talos:

talosctl kubeconfig --nodes 10.0.30.86 --endpoints 10.0.30.86 --talosconfig=./talosconfig

Once again, we are using the machine’s static IP now and omitting the --insecure flag.

You should now have access to the cluster through kubectl:

kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
talos-staging   Ready    control-plane   16h   v1.32.0

Apply Configs to Additional Nodes

With Kubernetes bootstrapped, the only thing left to do is apply the configuration to each subsequent node in the cluster, replacing controlplane.yaml with worker.yaml as appropriate.

For control plane nodes:

talosctl apply-config --insecure -f controlplane.yaml -n <ip-of-node> -e <ip-of-node> --config-patch @talos-staging.patch.yaml --config-patch @tailscale.patch.yaml

For workers:

talosctl apply-config --insecure -f worker.yaml -n <ip-of-node> -e <ip-of-node> --config-patch @talos-staging.patch.yaml --config-patch @tailscale.patch.yaml

Conclusion

I now have two fully functional Kubernetes clusters to begin my journey into Homelab-as-Code.

I can switch contexts in kubectl with:

kubectl config use-context admin@production # admin@staging for the staging cluster

Production:

kubectl get nodes
NAME                    STATUS   ROLES           AGE   VERSION
talos-prod-control-01   Ready    control-plane   10h   v1.32.0
talos-prod-control-02   Ready    control-plane   10h   v1.32.0
talos-prod-control-03   Ready    control-plane   10h   v1.32.0
talos-prod-worker-01    Ready    <none>          10h   v1.32.0
talos-prod-worker-02    Ready    <none>          10h   v1.32.0
talos-prod-worker-03    Ready    <none>          10h   v1.32.0

Staging:

kubectl get nodes
NAME            STATUS   ROLES           AGE   VERSION
talos-staging   Ready    control-plane   15h   v1.32.0

The next step will be to get Longhorn installed and configured on both clusters for persistent storage. Once that is complete, I’ll need to install and configure Traefik and Tailscale for ingress and service objects. Then I will be ready to start running workloads on Kubernetes!