cluster fu

I am always looking for ways to improve my home setup and one of the things I had been wanting to do for a while was to put together a dedicated homelab.

My desktop computer is a bit of a workhorse so it has always been used for creating virtual machines (VMs), running OCI containers such as Docker or Podman, LXC containers, and testing various warez.

Recently I acquired four ThinkCentre M900 Tiny devices. After dedicating one to become another ‘production’ system running Proxmox, I decided to use two as my homelab. Now, I could have put Proxmox on these two as well then created a cluster, but where is the fun in that? I felt like I didn’t know enough about how clustering is done on Linux so set out to build my own from scratch. The end goal was to have the ability to create LXC containers and VMs, which I can then use for testing and running OCI containers along with tools such as Kubernetes.

base camp

I get on well with Debian as a server operating system so installed the latest version (12 “Bookworm” at time of writing) on my two nodes, which I am naming pigley and goatley (IYKYK).

For the cluster I opted to use Pacemaker as the resource manager and Corosync for the cluster communication, with Glusterfs for shared storage.

Now before everyone writes in saying how two node clusters are not advised, due to quorum issues or split-brain scenarios, I have thought about it.

I didn’t want to use the forth ThinkCentre in this homelab, I have other plans for that. Instead I opted to use a spare RaspberryPi (named krieger) as a Corosync quorum device. This (should) counteract the issues seen in a two node cluster.

Ideally for Glusterfs I would configure krieger as an arbiter device, however in order to get the same version of glusterfs-server (10.3 at time of writing) on Raspbian I had to add the testing repo. Unfortunately I couldn’t get the glusterd service to start. The stable repo only offered glusterfs-server version 9.2-1 at the time, which was incompatible with 10.3-5 on pigley and goatley.

I decided to forgo the Glusterfs arbiter, while there is a risk of split-brain this is only a lab environment.

After provisioning pigley and goatley I installed the required packages

apt-get install pcs corosync-qdevice glusterfs-server

According to the documentation it is advisable to disable Pacemaker from automatic startup for now

systemctl disable pacemaker

On krieger I installed the corosync-qnetd package

apt-get install pcs corosync-qnetd

share and share alike

On pigley and goatley I created a partition on the main storage device and formatted it with XFS, created a mount point, and mounted the partition

mkfs.xfs -i size=512 /dev/sda3
mkdir -p /data/glusterfs/lab
mount /dev/sda3 /data/glusterfs/lab

Next I had to ensure pigley and goatley could talk to each other. To make things easy I put the IP addresses in /etc/hosts, then using the gluster tool confirmed connectivity

systemctl start glusterd
systemctl enable glusterd
gluster peer probe pigley
gluster peer status

I opted to configure a replica volume, keeping the data on 2 bricks (as that’s all I have)

gluster volume create lab0 replica 2 pigley.home.lab:/data/glusterfs/lab/brick0 \
    goatley.home.lab:/data/glusterfs/lab/brick0
gluster volume start lab0
gluster volume info

The data isn’t accessed directly in the brick directories, so I mounted the Glusterfs volume on a new mountpoint on both systems

mkdir /labfs && \
    mount -t glusterfs <hostname>:/lab0 /labfs

To test the replication was working I created a few empty files on one of the systems

touch /labfs/{a,b,c}

Then checked they existed on the other system

ls -l /labfs

And they did! Win win.

I did experience an issue when adding the /labfs mount in /etc/fstab, as it would try to mount before the glusterd service was running. To workaround this I included the noauto and x-systemd.automount options to my /etc/fstab entry

localhost:/lab0 /labfs glusterfs defaults,_netdev,noauto,x-systemd.automount 0 0

start your engine

Now the corosync config. On both nodes I created /etc/corosync/corosync.conf

cluster_name: lab
crypto_cipher: none >> crypto_cipher: aes256
crypto_hash: none >> crypto_hash: sha1
nodelist {
    node {
        name: pigley
        nodeid: 1
        ring0_addr: 192.168.1.8
    }
    node {
        name: goatley
        nodeid: 2
        ring0_addr: 192.168.1.9
    }
}

On one node I had to generate an authkey using corosync-keygen, then copied it (/etc/corosync/authkey) to the other node. I could then add the authkey to my cluster and restart the cluster services on each node

pcs cluster authkey corosync /etc/corosync/authkey --force
systemctl restart corosync && systemctl restart pacemaker

The cluster takes a short while to become clean so I monitored it using pcs status. The output below shows everything (except STONITH) is looking good

Cluster name: lab

WARNINGS:
No stonith devices and stonith-enabled is not false

Status of pacemakerd: 'Pacemaker is running' (last updated 2023-10-24 21:40:57 +01:00)
Cluster Summary:
  * Stack: corosync
  * Current DC: pigley (version 2.1.5-a3f44794f94) - partition with quorum
  * Last updated: Tue Mar 26 11:37:06 2024
  * Last change:  Tue Mar 26 11:36:23 2024 by hacluster via crmd on pigley
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Online: [ goatley pigley ]

Full List of Resources:
  * No resources

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

STONITH, or “Shoot The Other Node In The Head”, is used for fencing failed cluster nodes. As this is a test lab I am disabling it but may spend some time configuring it in the future

pcs property set stonith-enabled=false

the votes are in

As mentioned, I want to use a third system as a quorum device. This means that it casts deciding votes to protect against split-brain yet isn’t part of the cluster, so doesn’t have to be capable of running any resources.

While I used an authkey to authenticate pigley and goatley in the cluster, for krieger I had to use password authentication. On pigley I set the hacluster user’s password

passwd hacluster

On krieger I set the same password and started the quorum device

passwd hacluster
pcs qdevice setup model net --enable --start

Back on pigley I then authenticated krieger and specified it as the quorum device

pcs host auth krieger
pcs quorum device add model net host=krieger algorithm=ffsplit

The output of pcs quorum device status shows the QDevice information

Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Membership node list:   1, 2

Qdevice-net information
----------------------
Cluster name:           lab
QNetd host:             krieger:5403
Algorithm:              Fifty-Fifty split
Tie-breaker:            Node with lowest node ID
State:                  Connected

On krieger the output of pcs qdevice status net shows similar information

QNetd address:                  *:5403
TLS:                            Supported (client certificate required)
Connected clients:              2
Connected clusters:             1
Cluster "lab":
    Algorithm:          Fifty-Fifty split (KAP Tie-breaker)
    Tie-breaker:        Node with lowest node ID
    Node ID 2:
        Client address:         ::ffff:192.168.1.9:52060
        Configured node list:   1, 2
        Membership node list:   1, 2
        Vote:                   ACK (ACK)
    Node ID 1:
        Client address:         ::ffff:192.168.1.8:43106
        Configured node list:   1, 2
        Membership node list:   1, 2
        Vote:                   No change (ACK)

build something

Now my cluster is up and running I can start creating resources. The first thing I wanted to get running were some VMs.

I installed qemu on pigley and goatley

apt-get install qemu-system-x86 libvirt-daemon-system virtinst

Before creating a VM I made sure the default network was started, and set it to auto start

virsh net-list --all
virsh net-start default
virsh net-autostart default

I uploaded a Debian ISO to pigley then used virt-install to create a VM

virt-install --name testvm \
    --memory 2048 \
    --vcpus=2 \
    --cdrom=/labfs/debian-12.1.0-amd64-netinst.iso \
    --disk path=/labfs/testvm.qcow2,size=20,format=qcow2 \
    --os-variant debian11 \
    --network network=default \
    --graphics=spice \
    --console pty,target_type=serial -v

The command waits until the system installation is completed, so from my workstation I used virt-viewer to connect to the VM and run through the Debian installer

virt-viewer --connect qemu+ssh://pigley/system --wait testvm

Once the installation is complete and the VM has been rebooted I can add it as a resource to the cluster. First the VM (or VirtualDomain in libvirt speak) has to be shutdown and the configuration XML saved to a file

virsh shutdown testvm
virsh dumpxml testvm > /labfs/testvm.xml
pcs resource create testvm VirtualDomain \
    config=/labfs/testvm.xml \
    migration_transport=ssh \
    meta \
    allow-migrate=true

To allow the resource to run on any of the cluster nodes the symmetric-cluster option has to be set to true (I am not bothering with specific resource rules at this time). Then I can enable the resource

pcs property set symmetric-cluster=true
pcs resource enable testvm

Watching pcs resource I can see that the VM has started on goatley

  * testvm    (ocf:heartbeat:VirtualDomain):   Started goatley

On goatley I can check that the VM is running with virsh list

 Id   Name       State
--------------------------
 1    testvm     running

To connect from my workstation I can use virt-viewer again

virt-viewer --connect qemu+ssh://goatley/system --wait testvm

Now I can really test the cluster by moving the VM from goatley to pigley with one command from either node

pcs resource move testvm pigley

The VM is automatically shutdown and restarted on pigley, now the output of pcs resource shows

  * testvm    (ocf:heartbeat:VirtualDomain):   Started pigley

Successfully clustered!

Most of the VMs I create will probably be accessed remotely via ssh. The VM network on the cluster is not directly accessible from my workstation, I have to ProxyJump through whichever node is running the VM (this is by design)

ssh -J pigley testvm

Unless I check the resource status I won’t always know which node the VM is on, so I came up with a workaround.

The libvirt network sets up dnsmasq for local name resolution, so by setting the first nameserver on pigley and goatley to 192.168.122.1 (my virtual network) each node could resolve the hostnames of the VirtualDomains that are running on them. I set this in dhclient.conf

prepend domain-name-servers 192.168.122.1;

On my workstation I made use of the tagging capability and Match function in SSH to find which node a VM is on

Match tagged lab exec "ssh pigley 'ping -c 1 -W 1 %h 2>/dev/null'"
  ProxyJump pigley

Match tagged lab
  ProxyJump goatley

When I want to connect to a VM I specify a tag with the -P flag

ssh -P lab root@testvm

My SSH config will then use the first Match to ping that hostname on pigley and if the VM is running on pigley it will succeed, making pigley the proxy. If the ping fails SSH will go to the second Match and use goatley as the proxy.

this cloud is just my computer

Now that my cluster is up and running I want to be able to interact with it remotely.

For personal ethical reasons I opted to use OpenTofu instead of Terraform. OpenTofu’s command tofu is a drop-in replacement, simply swap terraform for tofu. OpenTofu doesn’t have a module for managing Pacemaker cluster resources but it does have a libvirt module, so I started there.

I created an OpenTofu configuration using the provider dmacvicar/libvirt. Unfortunately I had issues trying to get it to connect over SSH.

My OpenTofu configuration for testing looked like this

terraform {
    required_providers {
        libvirt = {
            source = "dmacvicar/libvirt"
            version = "0.7.1"
        }
    }
}

provider "libvirt" {
    uri = "qemu+ssh://pigley/system"
}

resource "libvirt_domain" "testvm" {
    name = "testvm"
}

After initialising (tofu init) I ran tofu plan and got an error

Error: failed to dial libvirt: ssh: handshake failed: ssh: no authorities for hostname: pigley:22

This was remedied by setting the known_hosts_verify option to ignore

...

provider "libvirt" {
    uri = "qemu+ssh://pigley/system?known_hosts_verify=ignore"
}

...

Another tofu plan produced another error

Error: failed to dial libvirt: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

At first it I thought it was due to using an SSH agent for my key, so I created a dedicated passphrase-less SSH keypair and specified the file in the uri

...

provider "libvirt" {
    uri = "qemu+ssh://pigley/system?keyfile=/home/pyratebeard/.ssh/homelab_tofu.key&known_hosts_verify=ignore"
}

...

This produced the same error again. After some Github issue digging I found mention of setting the sshauth option to privkey. The default is supposedly agent,privkey but as I found it isn’t picking up my agent, even with $SSH_AUTH_SOCK set. I set the option in the uri

...

provider "libvirt" {
    uri = "qemu+ssh://pigley/system?keyfile=/home/pyratebeard/.ssh/homelab_tofu.key&known_hosts_verify=ignore&sshauth=privkey"
}

...

Finally it worked! Until I tried to apply the plan with tofu apply

Error: error while starting the creation of CloudInit's ISO image: exec: "mkisofs": executable file not found in $PATH

This was easily fixed by installing the cdrtools package on my workstation

pacman -S cdrtools

After creating a new VM I want to automatically add it as a cluster resource. To do this I chose to use Ansible and so that I don’t have to run two lots of commands I wanted to use Ansible to deploy my OpenTofu configuration. Ansible does have a terraform module but there is not yet one for OpenTofu. A workaround to this is to create a symlink for the terraform command

sudo ln -s /usr/bin/tofu /usr/bin/terraform

Ansible will never know the difference! The task in the playbook looks like this

- name: "tofu test"
  community.general.terraform:
    project_path: '~src/infra_code/libvirt/debian12/'
    state: present
    force_init: true
  delegate_to: localhost

That ran successfully so I started to expand my OpenTofu config so it would actually build a VM.

In order to not have to go through the ISO install every time, I decided to use the Debian cloud images then make use of Cloud-init to apply any changes when the new VM is provisioned. Trying to keep it similar to a “real” cloud seemed like a good idea.

terraform {
    required_providers {
        libvirt = {
            source = "dmacvicar/libvirt"
            version = "0.7.1"
        }
    }
}

provider "libvirt" {
    uri = "qemu+ssh://pigley/system?keyfile=/home/pyratebeard/.ssh/homelab_tofu.key&known_hosts_verify=ignore&sshauth=privkey"
}

variable "vm_name" {
    type = string
    description = "hostname"
    default = "testvm"
}

variable "vm_vcpus" {
    type = string
    description = "number of vcpus"
    default = 2
}

variable "vm_mem" {
    type = string
    description = "amount of memory"
    default = "2048"
}

variable "vm_size" {
    type = string
    description = "capacity of disk"
    default = "8589934592" # 8G
}

resource "libvirt_volume" "debian12-qcow2" {
    name = "${var.vm_name}.qcow2"
    pool = "labfs"
    source = "http://cloud.debian.org/images/cloud/bookworm/latest/debian-12-genericcloud-amd64.qcow2"
    format = "qcow2"
}

resource "libvirt_volume" "debian12-qcow2" {
    name = "${var.vm_name}.qcow2"
    pool = "labfs"
    format = "qcow2"
    size = var.vm_size
    base_volume_id = libvirt_volume.base-debian12-qcow2.id
}

data "template_file" "user_data" {
    template = "${file("${path.module}/cloud_init.cfg")}"
    vars = {
        hostname = var.vm_name
    }
}

resource "libvirt_cloudinit_disk" "commoninit" {
    name = "commoninit.iso"
    pool = "labfs"
    user_data = "${data.template_file.user_data.rendered}"
}

resource "libvirt_domain" "debian12" {
    name = var.vm_name
    memory = var.vm_mem
    vcpu = var.vm_vcpus

    network_interface {
        network_name = "default"
        wait_for_lease = true
    }

    disk {
        volume_id = "${libvirt_volume.debian12-qcow2.id}"
    }

    cloudinit = "${libvirt_cloudinit_disk.commoninit.id}"

    console {
        type = "pty"
        target_type = "serial"
        target_port = "0"
    }
}

The cloud_init.cfg configuration is very simple at the moment, only setting the hostname for DNS to work and creating a new user

#cloud-config
ssh_pwauth: false

preserve_hostname: false
hostname: ${hostname}

users:
  - name: pyratebeard
    ssh_authorized_keys:
      - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICSluiY54h5FlGxnnXqifWPnfvKNIh1/f0xf0yCThdqV
    sudo: ['ALL=(ALL) NOPASSWD:ALL']
    shell: /bin/bash
    groups: wheel

Out of (good?) habit I tested this with tofu before running it with Ansible, and I hit another issue (getting tiring isn’t it?!)

Error: error creating libvirt domain: internal error: process exited while connecting to monitor: ... Could not open '/labfs/debian-12-genericcloud-amd64.qcow2': Permission denied

OpenTofu wasn’t able to write out the qcom2 file due to AppArmor. I attempted to give permission in /etc/apparmor.d/usr.lib.libvirt.virt-aa-helper yet that didn’t seem to work. Instead I set the following line in /etc/libvirt/qemu.conf and restarted libvirtd on pigley and goatley

security_device = "none"

That “fixed” my issue and I successfully built a new VM. Now I could try it with Ansible.

My playbook applies the infrastructure configuration, then creates a cluster resource using the same virsh and pcs commands as I used earlier

- hosts: pigley
  gather_facts: true
  become: true
  pre_tasks:
    - name: "load vars"
      ansible.builtin.include_vars:
        file: vars.yml
      tags: always

  tasks:
    - name: "create vm"
      community.general.terraform:
        project_path: '{{ tofu_project }}'
        state: present
        complex_vars: true
        variables:
          vm_name: "{{ vm_name }}"
          vm_vcpus: "{{ vm_vcpus }}"
          vm_mem: "{{ vm_mem }}"
          vm_size: "{{ vm_size }}"
        force_init: true
      delegate_to: localhost

    - name: "shutdown vm & dumpxml"
      ansible.builtin.shell: |
        virsh shutdown {{ vm_name }} && \
          virsh dumpxml {{ vm_name }} > /labfs/{{ vm_name }}.xml

    - name: "create cluster resource"
      ansible.builtin.shell: |
        pcs resource create {{ vm_name }} VirtualDomain \
        config=/labfs/{{ vm_name }}.xml \
        migration_transport=ssh \
        meta \
        allow-migrate=true

This is not the most elegant solution of adding the cluster resource, yet seems to be the only way of doing it.

The vars.yml file which is loaded at the beginning lets me define some options for the VM

vm_os: "debian12" # shortname as used in opentofu dir hierarchy
vm_name: "testvm"
vm_vcpus: "2"
vm_mem: "2048"
vm_size: "8589934592" # 8G

## location of opentofu project on local system
tofu_project: "~src/infra_code/libvirt/{{ vm_os }}/"

When the playbook runs the VM variables defined in vars.yml override anything configured in my OpenTofu project. This means that once my infrastructure configuration is crafted I only have to edit the vars.yml file and run the playbook. I can add more options to vars.yml as I expand the configuration.

When the playbook completes I can SSH to my new VM without knowing where in the cluster it is running, or even knowing the IP thanks to the SSH tag and the libvirt DNS

ssh -P lab pyratebeard@testvm

contain your excitement

For creating LXC (not LXD) containers the only (active and working) OpenTofu/Terraform modules I could find were for Proxmox or Incus. I have not been able to look into the new Incus project at the time of writing so for now I went with using Ansible’s LXC Container module.

On pigley I installed LXC and the required Python package for use with Ansible

apt-get install lxc python3-lxc

I opted not to configure unprivileged containers at this time, this is a lab after all.

After a quick test I decided to not use the default LXC bridge, instead I configured it to use the existing “default” network configured with libvirt. This enabled me to use the same SSH tag method for logging in as the nameserver would resolve the LXC containers as well. The alternative was to configure my own DNS as I can’t use two separate nameservers for resolution.

In /etc/default/lxc-net I switched the bridge option to false

USE_LXC_BRIDGE="false"

In /etc/lxc/default.conf I set the network link to the libvirt virtual device

lxc.net.0.link = virbr0

Then I restarted the LXC network service

systemctl restart lxc-net

To use my Glusterfs mount for the LXC containers I had to add the lxc.lxcpath configuration to /etc/lxc/lxc.conf

lxc.lxcpath = /labfs/

I tested this by manually creating an LXC container

lxc-create -n testlxc -t debian -- -r bookworm

Which resulted in a ACL error

Copying rootfs to /labfs/testlxc/rootfs...rsync: [generator] set_acl: sys_acl_set_file(var/log/journal, ACL_TYPE_ACCESS): Operation not supported (95)

The fix for this was to mount /labfs with the acl option in /etc/fstab

localhost:/lab0 /labfs glusterfs defaults,_netdev,noauto,x-systemd.automount,acl 0 0

With Ansible, creating a new container is straight forward

- name: Create a started container
  community.general.lxc_container:
    name: testlxc
    container_log: true
    template: debian
    state: started
    template_options: --release bookworm

Once it is created I could connect with the same SSH tag I used with the VMs

ssh -P lab root@testlxc

This wouldn’t let me in with the default build configuration, I am expected to set up a new user as I did with the VM cloud image. There is no way (that I know of) to use Cloud-init to do this with Ansible. Thankfully the Ansible LXC module has a container_command option, which allows specified commands to run inside the container on build.

I adjusted my playbook task to create a new user and included a task to load variables from a file

- hosts: pigley
  gather_facts: true
  become: true
  pre_tasks:
    - name: "load vars"
      ansible.builtin.include_vars:
        file: vars.yml
      tags: always

  tasks:
    - name: Create a started container
      community.general.lxc_container:
        name: "lxc-{{ lxc_name }}"
        container_log: true
        template: "{{ lxc_template }}"
        state: started
        template_options: "--release {{ lxc_release }}"
        container_command: |
          useradd -m -d /home/{{ username }} -s /bin/bash -G sudo {{ username }}
          [ -d /home/{{ username }}/.ssh ] || mkdir /home/{{ username }}/.ssh
          echo {{ ssh_pub_key }} > /home/{{ username }}/.ssh/authorized_keys

With the variables stored in vars.yml

lxc_template: "debian"
lxc_release: "bookworm"
lxc_name: "testlxc"
username: "pyratebeard"
ssh_pub_key: "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICSluiY54h5FlGxnnXqifWPnfvKNIh1/f0xf0yCThdqV"

Now I can log in with SSH

ssh -P lab pyratebeard@testlxc

With a similar command as the VM resource creation I tested adding the LXC container to the cluster

pcs resource create testlxc ocf:heartbeat:lxc \
    container=testlxc
    config=/labfs/testlxc/config \
    op monitor timeout="20s" interval="60s" OCF_CHECK_LEVEL="0"

This created the resource, which I could migrate between hosts as before

pcs resource move testlxc goatley

I updated my playbook to include the resource creation task

- name: "create cluster resource"
  ansible.builtin.shell: |
    pcs resource create {{ lxc_name }} ocf:heartbeat:lxc \
    container={{ lxc_name }}
    config=/labfs/{{ lxc_name }}/config \
    op monitor timeout="20s" interval="60s" OCF_CHECK_LEVEL="0"

And done! My homelab is now ready to use. I am able to quickly create Virtual Machines and LXC containers as well as accessing them via SSH without caring which cluster node they are on.

After testing all of the VM and container creations I made a small change to my SSH config to discard host key fingerprint checking. I set StrictHostKeyChecking to no, which stops the host key fingerprint accept prompt, then set the UserKnownHostsFile to /dev/null so that fingerprints don’t get added to my usual known hosts file.

Match tagged lab exec "ssh pigley 'ping -c 1 -W 1 %h 2>/dev/null'"
  ProxyJump pigley
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null

Match tagged lab
  ProxyJump goatley
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null

It is fun having my own little cloud, building it has certainly taught me a lot and hopefully will continue to do so as I improve and expand my OpenTofu and Ansible code.

If you are interested keep an eye on my playbooks and infra_code repositories.