Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introductions

So this is my collection of thoughts, ideas and craziness documented for everyone to see. I have a lot of different interests that I cycle through irregularly. It’s hard to keep track of the bits and pieces so I figure I should write some of this down.

I don’t like blogging, much to much like trying to keep a journal or a dairy ( not that there is anything wrong in that ). Instead I have specific, task-focused documents that I want to keep around so that when I revisit a topic, I can pick up where I left off.

If others find my documents and notes of interest, then that’s great. That is not the reason why I’m doing this though.

If folks have suggestions or corrections, then great! Create a fork, branch and submit a PR. I don’t mind being corrected and giving credit where credit is due. I’m not right all the time and there are much more knowledgable folks out there then me.

Kubernetes

Kubernetes is a big deal, and is something that I enjoy using.

I have decided to document my kubernetes setup so that other folks can follow along and see what I have going on, and maybe learn a thing or two along the way. It is also documented because I tend to rebuild the mess every couple of months and I usually forget something along the way!

Order of Operations

Extras

Pi-Burnetes

Installing Kubernetes

This is for a single host installation. I’ll include instructions for adding an additional host down the line if you are so inclined.

I have consolidated a number of different documents, mostly for the troubleshooting.

Setup

This is for installation on an Ubuntu 24.04 LTS machine. I would recommend a machine with at least 8GiB of RAM, 16GiB of hard drive space and at least two cores. The more the better things will run and the more stuff you can cram into Kubernetes.

Make sure your server is up to date before we get started.

sudo apt update && sudo apt full-upgrade -y 

Step By Step

Step One: Disable Swap

Kubeadm will complain if swap is enabled, so let’s disable able that.

sudo swapoff -a

Step Two: Kernel Parameters

There are some parameters that need to be tuned in the linux kernel for Kubernetes to work properly.

sudo tee /etc/modules-load.d/containerd.conf << EOF
overlay
br_netfilter
EOF
sudo tee /etc/sysctl.d/kubernetes.conf << EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
EOF
sudo modprobe overlay && sudo modprobe br_netfilter && sudo sysctl --system

Step Three: Container Runtime Installation

Let’s install containerd.

sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmour -o /etc/apt/trusted.gpg.d/docker.gpg && \
sudo add-apt-repository -y "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" && \
sudo apt update && \
sudo apt install -y containerd.io

Need to have a configuration file for containerd. Luckily, we can generate one.

containerd config default | sudo tee /etc/containerd/config.toml >/dev/null 2>&1 
sudo sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml

Then we’ll need to enable and restart containerd.

sudo systemctl enable containerd && \
sudo systemctl restart containerd

Step Four: Kubernetes Runtime Installation

This will install kubernetes 1.34 to be installed on the machine. At the time of writing, it is the most current version.

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.34/deb/Release.key | sudo gpg  --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg && \
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.34/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list && \
sudo apt update && \
sudo apt install -y kubelet kubeadm kubectl

Step Five: Kubeadm Execution

Now for the part that we’ve all been waiting for! The prerequisites are in place so now it’s time to get kubernetes up and running.

sudo kubeadm init --apiserver-advertise-address=10.0.0.9 --pod-network-cidr=10.250.0.0/16 --service-cidr=172.16.0.0/24

This will take a little while to run.

Step Six: Tool Configuration

Once the installation is complete, you will need to configure kubectl. Execute the following:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

This will copy the configuration to your home directory. This will allow you to use kubectl from anywhere on the system.

Step Seven: Remove the Taints

This is a single node system, which means that it’s also running as a control panel. Normally, kubernetes doesn’t want additional pods on the control panel, which makes for an interesting catch-22. Thankfully, we can fix that.

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

Step Eight: Container Networking

In order for the node to become ready, it will need networking installed. For simple, one node installs I personally think Calico is perfectly reasonable.

kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.30.3/manifests/calico.yaml

It will take a little bit for the networking jitters to settle.

Step Nine: Does it work?

Let’s pull up the cluster information

kubectl cluster-info

Do all nodes show READY?

kubectl get nodes -o wide

What do the pods look like?

kubectl get pods -o wide -A

Step Ten: Install Something!

You can also create a very basic nginx deployment just to see if things work.

kubectl apply -f https://gist.githubusercontent.com/bbrietzke/c59b6132c37ea36f9b84f1fee701a642/raw/952524cec7892e9db350fc62773c32ddfd9ab867/kubernetes-test.yaml

Then:

open http://kubernetes-host.local:30081/

And you should see a website in the browser of your choice.

Local Setup

There are some tools that I like to have on my local machine that makes working with Kubernetes much easier. This document will go through the installation and configuration of them.

What to install?

Since I’m on a Mac, everything is installed through (Homebrew)[https://brew.sh/].

brew install k9s helm kubernetes-cli

If you want to install on the Linux/Kubernetes host, here are a few options:

Helm

sudo apt-get install curl gpg apt-transport-https --yes
curl -fsSL https://packages.buildkite.com/helm-linux/helm-debian/gpgkey | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://packages.buildkite.com/helm-linux/helm-debian/any/ any main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm

K9S

K9S Github Releases and choose the one that is approriate for your platform.

Configuration

Kubernetes Tools

The only thing that really needs configuration is the kubernetes configuration tools. For that, you need to get a copy of the kube config file from the control-plane.

mkdir ~/.kube
scp ubuntu@kubernetes-host.local:.kube/config ~/.kube/config

This configures both k9s and kubectl, so that bit is done. Both of the tools should work as you would expect.

Helm

You don’t have to configure helm, but it’s not a bad idea either. My personal chart repository is configured as follows.

helm repo add bbrietzke http://bbrietzke.github.io/charts

I have a few charts that I tend to use, in particular for setting up namespaces. I have three, prod, dev, and infra. There isn’t much else to customize, so this just works.

helm install namespaces bbrietzke/namespaces

It’s also the first helm chart I created.

Adding a Worker Node

On the worker node, all you need to do is steps one through four of the installation page. After installation, execute the following on the control plane node:

kubeadm token create --print-join-command

Just copy and paste that over to the new worker node and it will do the rest.

Install MetalLB

Installation

Installation instruction can be found on there website.

First

kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl diff -f - -n kube-system

kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl apply -f - -n kube-system

Then:

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.15.2/config/manifests/metallb-native.yaml

Configuration

Address Pool

---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: primary-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.2.75-192.168.2.80

Advertisers

---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: l2advert-primary
  namespace: metallb-system
spec:
  ipAddressPools:
  - primary-pool
---
apiVersion: metallb.io/v1beta1
kind: BGPAdvertisement
metadata:
  name: bgpadvert
  namespace: metallb-system
spec:
  ipAddressPools:
  - primary-pool

Traefik Setup

Install

helm repo add traefik https://traefik.github.io/charts && \
helm repo update

Get Values

helm show values traefik/traefik > values.yaml

Install

kubectl create namespace traefik
helm upgrade -i --namespace traefik -f values.yaml traefik traefik/traefik 

My Values File

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Equal"
    effect: "NoSchedule"
providers:
  kubernetesGateway:
    enabled: true

NFS Persistent Volumes

Do your pods need to have persitent volumes for your home Kubernetes cluster? Turns out, NFS is an option

https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner

Helm Charts

You can add the helm chart with:

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/

Then always pull the values, since you will need to customize the NFS server IP and path

helm show values nfs-subdir-external-provisioner/nfs-subdir-external-provisioner > values.yaml

An example customized one looks like the following:

---
nfs:
  server: 192.168.2.70
  path: /srv/nfs/k8s
storageClass:
  create: true
  name: nfs-client
  defaultClass: true
tolerations:
  - key: "node-role.kubernetes.io/control-plane"
    operator: "Equal"
    effect: "NoSchedule"
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
resources:
  limits:
   cpu: 200m
   memory: 256Mi
  requests:
   cpu: 100m
   memory: 128Mi

And of course, we have to provision:

kubectl create namespace nfs-subdir
helm upgrade -i -f values.yaml nfs nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --namespace nfs-subdir

Metrics Server

Metrics server is not quite as easy to install as advertise, at least on system installed with kubeadm. But, the fix to make it work is pretty simple, assuming you do it the insecure way.

The correct way is more complicated, but also secure.

Installing

The code and instructions to install metrics server can be found here. Again, not repeating them here so they go stale.

The instructions work perfectly, things get installed and then don’t work. At all.

The metrics pod just never becomes ready and then logs complain about SSL certificates being invalid.

The Right Way to Install

You will need to go to each node and modify the kubelet config file and add a configuration line. That line is serverTLSBootstrap: true

sudo vi /var/lib/kubelet/config.yaml

You will then need to either restart the host machine or the kubelet service.

When the service comes back up, it will request a TLS certificate that the node can communicate with the rest of the infrastructure with. The only thing is you will need to sign the certificate to make it official. You can sign all of the outstanding certificate requests with:

for kubeletcsr in `kubectl -n kube-system get csr | grep kubernetes.io/kubelet-serving | awk '{ print $1 }'`; do kubectl certificate approve $kubeletcsr; done

Then metrics server will be happy.

Cheating and Being Insecure

The way I get the metrics server installed is to cheat a little. I tell it to ignore tls verfication and magically things just work!

How to do it

wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Then add --kubelet-insecure-tls to spec.template.spec.containers.args.

Install

kubectl apply -f components.yaml

The fix

Managing Certificates in Kubernetes

Dealing with TLS certificates is a pain in butt!

This document is just a reshash/shorten view with my specific configuration. You can find the full documentation over at cert-manager.io.

Installation

Helm

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.19.1/cert-manager.yaml

Go over verify section on the official docs to make sure it’s working.

Configuration

You have to create issuers per namespace that will actually create and distribute the certificates. It’s one of those resources that you created when you installed the helm charts.

Self-Signed

I created self-signed certificates for my namespaces just because.

Here is an example CRD:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: dev-selfsigned-issuer
  namespace: dev
spec:
  selfSigned: {}

I’m sure I’ll write up a Helm chart at somepoint with the issuers that I need.

Using the Certificate Manager

The certificates are mostly used by your ingress controllers to prove that the domain is valid and encrypt the communications between the origin and the client. I’m sure that can be used else where, but this is the scenerio that I use them for.

You will need to modify the ingress resource defination to be similiar to:

...
kind: Ingress
metadata:
    namespace: dev
    annotation:
        cert-manager.io/issuer: dev-selfsigned-issuer
...

The namespace must match the name of the issuer for that namespace.

Cloud Native PostgreSQL

It’s hard to build modern applications without a database. I prefer PostgreSQL and this is the operator that will help setup a clustered PostgreSQL installation in Kubernetes

Installation

kubectl apply -f  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.27/releases/cnpg-1.27.1.yaml

The Actual Cluster

You will want to use a good and fast network attached storage to keep database performance high. I’m using iSCIS here.

kubectl create namespace db
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgresql
  namespace: db
spec:
  instances: 3
  postgresql:
    parameters:
      shared_buffers: "256MB"
  storage:
    size: 50G
    storageClass: iscsi
  resources:
    requests:
      memory: "1024Mi"
      cpu: 500m
    limits:
      memory: "4096Mi"
      cpu: 2000m

Redis Operator

The Redis operator is a pretty handy operator to have around considering that Redis is a pretty popular piece of software

The complete documentation can be (found over here)[https://github.com/OT-CONTAINER-KIT/redis-operator].

Installation

Setup the helm chart.

helm repo add redis-ot https://ot-container-kit.github.io/helm-charts/

Install the operator.

kubectl create namespace redis-system
helm upgrade -i redis-operator redis-ot/redis-operator --namespace redis-system

Create a Redis instance

I’ve only create standalone instances for individual namespaces. I haven’t tried to work through a cluster installation or how that would work.

Here is an example of a stand alone instance example.

apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: Redis
metadata:
  name: redis-example
  namespace: default
spec:
  kubernetesConfig:
    image: quay.io/opstree/redis:latest
    imagePullPolicy: IfNotPresent
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi
  podSecurityContext:
    runAsUser: 1000
    fsGroup: 1000

This will setup a redis instance and services to point to the instance.

Introduction

Build SLURM on Alma Linux 9

Slurm (Simple Linux Utility for Resource Management) is a powerful, open-source workload manager for Linux clusters, scheduling and allocating resources (CPUs, GPUs, memory) for parallel jobs, essential for High-Performance Computing (HPC) and AI workloads.

Overview

This runbook covers building SLURM RPM packages from source on Alma Linux 9. These RPMs can then be distributed to controller and compute nodes in your cluster.

Prerequisites

  • A machine with Alma Linux 9 installed
  • Root or sudo access
  • Approximately 2GB free disk space for build process
  • Internet connectivity for downloading source and dependencies

Steps

  1. Prepare RPM build environment
    # Create RPM build directory structure
    mkdir -p ~/rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
    echo '%_topdir %(echo $HOME)/rpmbuild' > ~/.rpmmacros
  1. Download SLURM source code

    The latest version can be found here.

    sudo dnf install -y wget
    
    # Set version variable for easy updates
    SLURM_VERSION="25.11.0"
    wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2
  1. Install build dependencies
    # Enable EPEL and CodeReady Builder repositories
    sudo dnf install -y epel-release
    sudo dnf config-manager --set-enabled crb
    
    # Install core build tools and SLURM dependencies
    sudo dnf install -y rpm-build autoconf automake gcc make \
        mariadb-devel munge-devel munge-libs pam-devel \
        perl-Switch perl-ExtUtils-MakeMaker perl-devel \
        pmix pmix-devel readline-devel \
        hdf5 hdf5-devel lua lua-devel \
        rrdtool-devel openssl openssl-devel libssh2-devel \
        hwloc hwloc-devel ncurses-devel man2html \
        libibmad libibumad http-parser-devel json-c-devel
  1. Build RPM packages

    This process typically takes 10-15 minutes depending on your system.

    rpmbuild --with slurmrestd --with pmix --with hdf5 -ta slurm-${SLURM_VERSION}.tar.bz2
    
  2. Verify and locate built RPMs

    ls -lh ~/rpmbuild/RPMS/x86_64/
    

    You should see RPMs including:

    • slurm-*.rpm - Core SLURM binaries
    • slurm-slurmctld-*.rpm - Controller daemon
    • slurm-slurmd-*.rpm - Compute node daemon
    • slurm-slurmdbd-*.rpm - Database daemon
    • slurm-slurmrestd-*.rpm - REST API daemon
    • slurm-pam_slurm-*.rpm - PAM module
    • slurm-devel-*.rpm - Development headers

Troubleshooting

Missing Dependencies

If rpmbuild fails with missing dependency errors:

  1. Note the missing package name from the error message
  2. Install using: sudo dnf install -y <package-name>-devel
  3. Re-run the rpmbuild command

Build Failures

  • Check build logs at: ~/rpmbuild/BUILD/slurm-*/config.log
  • Ensure CRB repository is enabled for http-parser-devel and json-c-devel
  • Verify sufficient disk space: df -h ~

Common Issues

  • “rpmbuild: command not found”: Install rpm-build package
  • Permission denied errors: Ensure you’re building as regular user, not root
  • Network timeouts: Check firewall settings if wget fails

Completion and Verification

Success criteria:

  • ✓ No errors during rpmbuild process
  • ✓ Multiple RPM files present in ~/rpmbuild/RPMS/x86_64/
  • ✓ RPM integrity check passes: rpm -K ~/rpmbuild/RPMS/x86_64/slurm-*.rpm

File sizes (approximate):

  • Core slurm package: ~15-20 MB
  • Total RPMs: ~40-50 MB

Contacts

Appendix

Package Distribution Guide

  • Controller node: slurm, slurm-slurmctld, slurm-slurmdbd (if using accounting)
  • Compute nodes: slurm, slurm-slurmd
  • Submit hosts: slurm (client tools only)

References

Changelog

  • 2025/12/14 - Created initial version

Introduction

RAID Arrays

Let’s build a few different kinds of RAID arrays and then use them.

I’m not going to go into detail about what RAID is or the different levels, since there is plenty of documentation out there already. These are the commands to set up software RAID for Linux and the general workflow to follow.

Do we have any now?

Let’s double‑check the current RAID state.

cat /proc/mdstat

Which Devices?

Insert the new drives into the machine. You should know what they come up as, but if you don’t, try:

lsblk

That should give you a list of all the block devices on the system and where they are being used. Some of them may not make sense at first glance, but you should see the ones you just added. If needed, run the command and copy down the results. Then insert the drives and execute it again.

I will be using the following devices:

/dev/sda
/dev/sdb
/dev/sdc
/dev/sdd

Create the Arrays!

RAID 0

We’ll start with RAID 0, which allows us to use all the drives as one big (though not redundant) block device.

sudo mdadm --create --verbose /dev/md0 \
  --level=raid0 --raid-devices=4 \
  /dev/sda /dev/sdb /dev/sdc /dev/sdd

RAID 1

Simple mirroring. The easiest to use, a decent redundancy package, and not all that wasteful.

sudo mdadm --create --verbose /dev/md1 \
  --level=raid1 --raid-devices=2 \
  /dev/sda /dev/sdb

RAID 5

Probably the best all‑around choice. It gives the best use of capacity and good performance.

sudo mdadm --create --verbose /dev/md2 \
  --level=raid5 --raid-devices=4 \
  /dev/sda /dev/sdb /dev/sdc /dev/sdd

Create the FileSystem

sudo mkfs -t ext4 /dev/md0

Mounting

You should have these drives come up every time you want to use them, so add the entries to /etc/fstab.

First, obtain the UUID of the array.

sudo blkid /dev/md0

Take the UUID and add the following line to /etc/fstab (adjust the mount point and options as needed):

UUID=655d2d3e-ab31-49c7-9cc3-583ec81fd316 /srv ext4 defaults 0 0

Then run:

sudo mount -a

and the array should appear at the desired location.

Update Configuration

sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf

Then update the initramfs so that the RAID arrays are recognised at boot:

sudo update-initramfs -u

Destroying RAID Arrays

You create RAID drives, so you should know how to tear them down.

So we built a few RAID arrays and mounted them—now let’s tear them down.

Unmount everything

Make sure you have the arrays unmounted from their normal (or abnormal) paths. If you don’t, much of this will not work as intended.

Check mdstat

Let’s see which arrays we currently have configured.

cat /proc/mdstat

The output should list any arrays you have built and indicate their current state. In this case, we need to know the names of the arrays so that we can remove them.

Remove Arrays

Now let’s stop the arrays.

sudo mdadm --stop /dev/md127  # or whatever was returned above

Once the arrays are stopped, we can zero out the superblocks so the disks are clean.

sudo mdadm --zero-superblock /dev/sda /dev/sdb /dev/sdc /dev/sdd

Trust, but Verify

Rerun cat /proc/mdstat and make sure those arrays are gone.

cat /proc/mdstat

Custom Host Publishing

If you’re like me, you probably have AVAHI running on all your servers just to make name resolution simplier. What you probably didn’t know is you can do some neat tricks with Avahi.

Neat Trick Number One

There is a command call avahi-publish that will publish a hostname on your network. This is pretty cool, because it means you don’t have to remember to add the hostname to your hosts file. It also means you can use this to publish a hostname on a server that doesn’t have Avahi installed.

For example:

avahi-publish -a -R pandora.local 192.168.1.10 # or what-ever IP address that you want...

Now you can ping pandora.local and it will respond!

What good is this? Imagine giving your router an avahi registered name and you can log into it without having to remember the IP. If you’re on Xfinity, you can do:

avahi-publish -a -R router.local 10.0.0.1

You will be able to ping router.local and it will respond!

Neat Trick Number Two

So pinging a router by name is nice and all, but not really that exciting.

What you can do is combine the above with Kubernetes’ ingress resource definations to have multiple ingress’ on the same host without having to anything magical to DNS.

Neat Trick Number Three

Again, neat-o and all, but now you have a terminal up and running hosting names and that’s just a waste of energy. What if the terminal window closes or the machine resets? Then you have to manually execute the commands to get the network back online.

Systemd to the rescue!

[Unit]
Description=Avahi OwnCloud

[Service]
ExecStart=/usr/bin/avahi-publish -a -R pandora.local 10.0.0.238
Restart=always

[Install]
WantedBy=default.target

Save the file in /etc/systemd/system ( i.e. /etc/systemd/system/avahi-pandora.service ). Then treat it as any normal service.

sudo systemctl enable avahi-pandora
sudo systemctl start avahi-pandora

Systems and Application Monitoring with Prometheus

Figuring out what’s going on when something breaks can be difficult, so having the right tooling can make a big difference. Prometheus is one such tool. With proper tuning and configuration, it can even alert you before a failure occurs.

It also makes pretty graphs—everyone loves good visualisations!

What is Prometheus?

Prometheus is a free and open‑source monitoring solution that collects and records metrics from servers, containers, and applications. It provides a flexible query language (PromQL) and powerful visualisation tools, and includes an alerting mechanism that sends notifications when needed.

Prerequisites

  • A machine capable of running Ubuntu 22.04 (or any other LTS release).
  • Basic administrative knowledge and an account with sudo access on that machine.

Installation

Update the system

sudo apt update && sudo apt -y upgrade

Create the Prometheus user account

sudo groupadd --system prometheus
sudo useradd -s /sbin/nologin --system -g prometheus prometheus

Create directories

These directories will hold Prometheus’s configuration files and data store.

sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus

Install Prometheus

Note: The following steps install the latest LTS tarball of Prometheus (as of this writing). You can download a different release from the official site: https://prometheus.io/download/#prometheus.

wget https://github.com/prometheus/prometheus/releases/download/v3.5.0/prometheus-3.5.0.linux-amd64.tar.gz
tar zvxf prometheus*.tar.gz
cd prometheus*/
sudo mv prometheus /usr/local/bin
sudo mv promtool /usr/local/bin
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
sudo mv prometheus.yml /etc/prometheus
sudo chown prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus

cd ..
rm -rf prom*

Configuring

The full configuration guide is available here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/. It walks through almost every option, though the wording can be confusing. Below is a simple, real‑world example of a prometheus.yml file that demonstrates typical usage.

# my global config
global:
  scrape_interval: 15s  # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s  # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
        labels:
          groups: 'monitors'
  - job_name: 'servers'
    static_configs:
      - targets:
          - 'atlas.faultycloud.lan:9182'
          - 'coeus.faultycloud.lan:9182'
          - 'gaia.faultycloud.lan:9182'
          - 'hyperion.faultycloud.lan:9182'
        labels:
          groups: 'win2022'
  - job_name: 'gitlab'
    static_configs:
      - targets:
          - '192.168.1.253:9090'
        labels:
          groups: 'development'

Run at Startup

Create a systemd service file at /etc/systemd/system/prometheus.service with the following content:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/

[Install]
WantedBy=multi-user.target

Reload systemd, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

Systems and Application Notifications with Alert Manager

So you have Prometheus, the next step is AlertManager, which will notify you when an something goes awry.

AlertManager?

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

Prerequisites

A machine that can run Ubuntu 22.04 ( or other LTS ).

You should also have basic administrative knowledge and an account that has sudo access on the above box.

Installation

Update the system

sudo apt update && sudo apt -y upgrade

Create the AlertManager User Account

sudo groupadd --system alertmanager
sudo useradd -s /sbin/nologin --system -g alertmanager alertmanager

Create Directories

These are for configuration files and libraries.

sudo mkdir /etc/alertmanager
sudo mkdir /var/lib/alertmanager

Install AlertManager

Now for the fun part!

// https://prometheus.io/download/#alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
tar zvxf alertmanager*.tar.gz
cd alertmanager*/
sudo mv alertmanager /usr/local/bin
sudo mv amtool /usr/local/bin
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/amtool
sudo chown alertmanager:alertmanager /var/lib/alertmanager

sudo mv alertmanager.yml /etc/alertmanager

cd ..
rm -rf alert*

Run at Startup

sudo nano /etc/systemd/system/alertmanager.service

with

[Unit]
Description=AlertManager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
    --config.file /etc/alertmanager/alertmanager.yml \
    --storage.path=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
sudo systemctl status alertmanager

Systems Monitoring with Node Exporter

Trying to figure out what is going on when something is broken can be hard, so it’s nice to have tooling to help with that. Prometheus is one such tool. With proper tuning and effort, it can warn you before something goes wrong.

And it makes pretty graphs. Everybody loves pretty graphs!

Prometheus

Prometheus is a free and open‑source monitoring solution for collecting metrics, events, and alerts. It records data from servers, containers, and applications. In addition to a flexible query language (PromQL) and powerful visualization tools, it also provides an alerting mechanism that sends notifications when needed.

Prerequisites

  • A machine running Ubuntu 24.04 or another LTS release.
  • Basic administrative knowledge and an account with sudo access on that machine.

Installation

Update the system

sudo apt update && sudo apt -y upgrade

Create the Node Exporter user account

sudo groupadd --system nodeexporter
sudo useradd -s /sbin/nologin --system -g nodeexporter nodeexporter

Create directories

These directories store configuration files and libraries.

sudo mkdir /var/lib/node_exporter

Install Node Exporter

Now for the fun part! You can view the latest Node Exporter downloads and pick the one you need on the official page: Node Exporter download page.

wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz
tar zvxf node_exporter*.tar.gz
cd node_exporter*/
sudo mv node_exporter /usr/local/bin
sudo chown nodeexporter:nodeexporter /usr/local/bin/node_exporter
sudo chown -R nodeexporter:nodeexporter /var/lib/node_exporter

cd ..
rm -rf node*

Run at startup

Create a systemd unit file for Node Exporter:

sudo nano /etc/systemd/system/node_exporter.service

Paste the following configuration:

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nodeexporter
Group=nodeexporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --web.listen-address=0.0.0.0:9182 \
    --collector.textfile.directory=/var/lib/node_exporter

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
sudo systemctl status node_exporter

Cassandra Installation

Install Apache Cassandra

# Install Java (Cassandra 4.x requires Java 11)
sudo apt install -y openjdk-11-jdk-headless

# Add the Cassandra repository and its key
echo "deb [signed-by=/etc/apt/keyrings/apache-cassandra.asc] https://debian.cassandra.apache.org 41x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
sudo curl -o /etc/apt/keyrings/apache-cassandra.asc https://downloads.apache.org/cassandra/KEYS

# Update package lists and install Cassandra
sudo apt update && sudo apt install -y cassandra

# Reload systemd, enable, and start the Cassandra service
sudo systemctl daemon-reload
sudo systemctl enable cassandra
sudo systemctl start cassandra

# Verify the service status
sudo systemctl status cassandra.service
nodetool status

Tip: After the first startup, Cassandra may take a minute or two to open all its ports. Use nodetool status to confirm that the node reports UN (Up/Normal).

Firewall Setup

# Port used by Cassandra for client communication (CQL)
sudo ufw allow 9042/tcp

# Inter‑node communication port (native transport)
sudo ufw allow 7000/tcp

# SSTable stream transfer port
sudo ufw allow 7001/tcp

# JMX port (remote management)
sudo ufw allow 7199/tcp

Note: If you use SSL for inter‑node communication, you’ll also need to open port 7002/tcp.


References

Network Attached Storage

There are plenty of network‑attached storage solutions out there, both free and open source. But why would we want to build one from scratch?

Understanding how these tools work gives you insight into their inner workings and opens the door to customisations that off‑the‑shelf solutions can’t provide.

Starting with Samba

I prefer to run on Ubuntu, so I’m using the current LTS (24.04.3) for this guide. The server is named vault.

First, update the system:

sudo apt update && sudo apt -y full-upgrade

Next, create the parent directories:

sudo mkdir -p /srv/{samba,nfs}

Install Samba

sudo apt install -y samba

Create shared directories

sudo mkdir -p /srv/samba/{public,ISOs}

Edit the configuration

Open the Samba configuration file:

sudo vi /etc/samba/smb.conf

Add the following sections (replace or append as appropriate):

[homes]
  comment = Home Directory
  browsable = no
  read only = no
  create mask = 0700
  directory mask = 0700
  valid users = %S

[ISOs]
  comment = ISO files
  path = /srv/samba/ISOs
  browseable = yes
  read only = no
  create mask = 0700
  directory mask = 0700
  guest ok = yes

[public]
  comment = Public Share
  path = /srv/samba/public
  browseable = yes
  read only = no
  create mask = 0766
  directory mask = 0766
  guest ok = yes

Tip: Granting guest ok = yes allows unauthenticated access. Use it only for truly public data.

Add user accounts

sudo smbpasswd -a $USER

Restart services

Restart the Samba daemons whenever you add a share or create a new user:

sudo systemctl restart smbd nmbd

Open the firewall

sudo ufw allow samba

NFS Installation

sudo apt install -y nfs-kernel-server

Create shared directories

sudo mkdir -p /srv/nfs/{common,k8s}

Setup permissions

sudo chown -R nobody:nogroup /srv/nfs/
sudo chmod -R 755 /srv/nfs/

Edit the NFS export list

Open /etc/exports:

sudo vi /etc/exports

Add the following lines (modify subnets as appropriate):

/srv/nfs/common 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
/srv/nfs/k8s 10.0.0.0/26(rw,sync,no_subtree_check,no_root_squash)

Export the shares

sudo exportfs -a

Restart the NFS service

sudo systemctl restart nfs-kernel-server

Configure firewall rules

Adjust for your network.

sudo ufw allow from 192.168.1.0/24 to any port nfs
sudo ufw allow from 10.0.0.0/26 to any port nfs

For more detailed information, refer to the Samba documentation and the NFS manual pages.

Network Attached Storage – iSCSI

iSCSI is typically used when you need fast, block‑level storage for workloads such as databases. It isn’t usually something you’d attach to a consumer desktop.

Starting with iSCSI

I prefer to run on Ubuntu, so I’m using the current LTS (24.04.3) for this guide. The server is named vault.

First, update the system:

sudo apt update && sudo apt -y full-upgrade

Next, create the parent directories:

sudo mkdir -p /srv/iscsi

Install the iSCSI target

sudo apt install -y targetcli-fb

Create shared image files

Create some backing files that can be used as targets. Optionally, you can use disk partitions or LVM volumes.

sudo dd if=/dev/zero of=/srv/iscsi/disk00.img bs=1 count=0 seek=50G # Creates a 10GB image file
sudo dd if=/dev/zero of=/srv/iscsi/disk01.img bs=1 count=0 seek=50G # Creates a 50GB image file
sudo dd if=/dev/zero of=/srv/iscsi/disk02.img bs=1 count=0 seek=50G # Creates a 50GB image file
sudo dd if=/dev/zero of=/srv/iscsi/disk03.img bs=1 count=0 seek=50G # Creates a 50GB image file
sudo dd if=/dev/zero of=/srv/iscsi/disk04.img bs=1 count=0 seek=50G # Creates a 50GB image file
sudo dd if=/dev/zero of=/srv/iscsi/disk05.img bs=1 count=0 seek=50G # Creates a 50GB image file

The image files start out small and grow as data is written to them. Be careful not to over‑provision the system, as it becomes a single point of failure for anything that uses iSCSI.

Mount on loopback and format new drives (optional)

sudo losetup -f /srv/iscsi/disk00.img
sudo losetup -a
sudo mkfs.ext4 /dev/loopX   # replace X with the loop device returned by losetup -a
sudo losetup -d /dev/loopX

Or format all of them in one shot:

for image in /srv/iscsi/*.img; do
  sudo losetup -f "$image"
  sudo mkfs.ext4 /dev/loop0
  sudo losetup -d /dev/loop0
done

Configure the iSCSI target using TargetCLI

sudo targetcli

Inside the shell, follow the steps below to create a backstore from the disk images you created.

cd /backstores/fileio
create disk00 /srv/iscsi/disk00.img
create disk01 /srv/iscsi/disk01.img
create disk02 /srv/iscsi/disk02.img
create disk03 /srv/iscsi/disk03.img
create disk04 /srv/iscsi/disk04.img
create disk05 /srv/iscsi/disk05.img

Create the actual LUNs:

cd /iscsi
create iqn.2025-11.com.example:storage
cd iqn.2025-11.com.example:storage/tpg1/luns
create /backstores/fileio/disk00
create /backstores/fileio/disk01
create /backstores/fileio/disk02
create /backstores/fileio/disk03
create /backstores/fileio/disk04
create /backstores/fileio/disk05

To enable authentication:

cd ../
set attribute generate_node_acls=1
set attribute demo_mode_write_protect=0
set attribute authentication=1
set auth userid=your_username
set auth password=your_secret_password

To restrict iSCSI to specific addresses:

cd iqn.2025-11.com.example:storage/tpg1/portals
delete 0.0.0.0:3260
create 10.0.0.5:3260

To save the configuration:

cd /
saveconfig
exit

Verification / validation

sudo targetcli ls

Setup firewall

sudo ufw allow 3260/tcp

N8N the hard way

Runbooks

Runbooks are a way of documenting procedural Information Technology information that is repetitive in nature.

Most of the time, you see runbooks as a way of troubleshooting a problem, diagnosising an issue or a procedure.

This is a collection of the runbooks that I have decided to document for my area.

Table of Contents

Alerts

Create User for PostgreSQL

Overview

Pre-Requistes

  • Access to login into database server via ssh.
  • sudo access on database server

Steps

  1. Log into database server via SSH.
  2. Sudo into the postgres user and execute the psql command line tool.
    • sudo -u postgres psql
  3. Create the user with and encrypted password:
    • We recommend using a long and complex password since the values will only be saved here and can be input into the target system at the same time.
    • The user name should reflect the database that they are primarily accessing.
    • CREATE USER db01 WITH ENCRYPTED PASSWORD 'R3@llyL0ngP@ssw0rd123456790';
  4. Grant the new user dbcreate privileges:
    • ALTER USER db01 CREATEDB;

Troubleshooting

  • If the target system can not log into the database after the database/user creation process, you can simply re-run the above steps to make sure to make sure that they are correct.
  • If the user is present, but the password has been forgotten, you may reset the password as follows:
    • ALTER USER db01_user WITH ENCRYPTED PASSWORD '3^On9D4p59^4';

Completion and Verification

You can log into the target database from callisto.lan to verify if everything is setup correctly

psql -h 'callisto.lan' -U 'db01' -d 'db01_prod'

It will prompt you for the password, which we have available and should provide. If the login occurs, then the database/user/permissions should be okay.

Contacts

Not Applicable

Appendix

Changelog

* 2023/04/06 - Created