All Articles
Self-Hosting Infrastructure

Monitoring a K3s Cluster: Prometheus, Grafana, and Beyond

Deploying a full observability stack on K3s — kube-prometheus-stack for metrics and dashboards, Uptime Kuma for status pages, Glances for system-level monitoring, and Traefik dashboard for ingress visibility.

Monitoring a K3s Cluster: Prometheus, Grafana, and Beyond

Four Layers of Monitoring

Running services without visibility into what’s happening is asking for trouble. I set up four complementary monitoring tools, each covering a different layer:

  1. Prometheus + Grafana (kube-prometheus-stack) — Cluster metrics, pod resource usage, node health, alerting rules
  2. Uptime Kuma — External endpoint monitoring with a public status page
  3. Glances — Real-time system metrics (CPU, RAM, disk, network) per node via REST API
  4. Traefik Dashboard — Ingress routing visibility, entrypoints, middleware chain

kube-prometheus-stack: One Helm Chart, Full Observability

The kube-prometheus-stack bundles Prometheus, Grafana, Alertmanager, and ~30 pre-built dashboards into a single Helm release. I deploy it with an Ansible playbook that uses kubernetes.core.helm:

- name: Install kube-prometheus-stack
  kubernetes.core.helm:
    kubeconfig: "{{ kubeconfig_path }}"
    name: kube-prometheus-stack
    chart_ref: prometheus-community/kube-prometheus-stack
    chart_version: "82.10.3"
    release_namespace: monitoring
    create_namespace: true
    values:
      grafana:
        adminPassword: "{{ vault_grafana_admin_password }}"
        ingress:
          enabled: true
          hosts:
            - grafana.williamcruz.ch
          annotations:
            cert-manager.io/cluster-issuer: letsencrypt-prod
          tls:
            - secretName: grafana-tls
              hosts:
                - grafana.williamcruz.ch

The Grafana admin password comes from Ansible Vault — never hardcoded in values files. After deployment, Grafana is accessible at grafana.williamcruz.ch with a valid Let’s Encrypt certificate.

What Prometheus Collects

Out of the box, the stack scrapes metrics from:

  • kubelet — Pod lifecycle, container resource usage
  • kube-state-metrics — Kubernetes object state (deployments, pods, nodes)
  • node-exporter — Hardware and OS metrics from every node
  • CoreDNS — DNS query rates and errors
  • etcd — K3s embedded datastore health

This gives me dashboards for everything: CPU/memory per pod, disk I/O per node, network traffic, Kubernetes API latency, and more.

Uptime Kuma: The Public Status Page

Uptime Kuma monitors all public endpoints with HTTP keyword checks and TLS certificate validation. It runs on the cluster as a Helm deployment in the monitoring namespace.

The public status page at status.williamcruz.ch shows real-time uptime for every service:

  • williamcruz.ch (Portfolio)
  • dashboard.williamcruz.ch
  • grafana.williamcruz.ch
  • argocd.williamcruz.ch
  • n8n.williamcruz.ch
  • auth.williamcruz.ch (Keycloak)
  • And every other subdomain

Uptime Kuma also receives push heartbeats from backup CronJobs — if a backup doesn’t check in within its expected window, the monitor turns red. This bridges monitoring with the backup system.

Glances: System-Level Metrics

Prometheus handles Kubernetes-level metrics well, but sometimes I need raw system data — actual disk usage, network interface throughput, running processes. Glances runs as a DaemonSet so every node has an instance, exposed via REST API.

The infrastructure dashboard (a separate Next.js app) polls the Glances API to show real-time CPU, RAM, and disk usage per node.

Traefik Dashboard

Traefik’s built-in dashboard shows all active routers, services, middlewares, and entrypoints. I expose it at traefik.williamcruz.ch with BasicAuth protection via an Ansible playbook:

- name: Deploy Traefik Dashboard with BasicAuth
  kubernetes.core.k8s:
    kubeconfig: "{{ kubeconfig_path }}"
    state: present
    definition:
      apiVersion: traefik.io/v1alpha1
      kind: IngressRoute
      metadata:
        name: traefik-dashboard
        namespace: kube-system
      spec:
        entryPoints: [websecure]
        routes:
          - match: Host(`traefik.williamcruz.ch`)
            kind: Rule
            middlewares:
              - name: traefik-auth
            services:
              - name: api@internal
                kind: TraefikService

How They Work Together

Each tool covers a blind spot the others miss:

  • Prometheus tells me a pod is using 90% of its memory limit
  • Uptime Kuma tells me the endpoint is returning 502 errors to real users
  • Glances tells me the node’s disk is 95% full
  • Traefik Dashboard tells me the ingress route is misconfigured

When the n8n automation layer sits on top (covered in a separate article), these tools feed into automated workflows that detect issues and send Signal alerts before I notice anything is wrong.

Deployment Order Matters

The monitoring stack deploys after K3s and ArgoCD are running but before any application. This way I can observe every subsequent deployment. The Ansible playbook order:

  1. setup_kube_prometheus_stack.yaml — Prometheus + Grafana
  2. setup_uptimekuma.yaml — Uptime Kuma
  3. setup_glances.yaml — Glances DaemonSet
  4. setup_traefik_dashboard.yaml — Traefik with BasicAuth

Total deployment time for the full monitoring stack: about 10 minutes.

Resource Cost

On a 3-node cluster with 12 vCPUs total, the monitoring stack consumes roughly:

  • Prometheus: ~500 MB RAM, minimal CPU when not scraping
  • Grafana: ~200 MB RAM
  • Uptime Kuma: ~150 MB RAM
  • Glances: ~50 MB RAM per node

It’s noticeable but acceptable. Prometheus is the heaviest component — if resources get tight, adjusting scrape intervals and retention periods is the first lever.