diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-17 08:46:17 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-17 08:46:17 +0300 |
| commit | ca37dfb4e3f5a8e4af00343fa262874da4e1080f (patch) | |
| tree | 8fbb93f742bf9baaff6500b6921d4a2894f9378e | |
| parent | 8d94f7982b63a1f7c971d9788c177f374abec102 (diff) | |
f3s skill: update observability, freebsd-setup, hardware refs
- observability.md: reflect current state — Grafana/Loki/Tempo disabled
(SQLite-on-NFS unreliable), Alloy running with minimal config only,
Prometheus TSDB wiped and restarted clean (2026-05-16 incident).
Add TSDB recovery runbook. Document .disabled manifest pattern.
- freebsd-setup.md: add coretemp kernel module section — all f-hosts
now have coretemp_load="YES" in /boot/loader.conf for persistent
per-core die temps (hw.acpi.thermal.tz0 is unreliable).
- hardware.md: add f3 IP (192.168.1.133) to LAN table.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| -rw-r--r-- | prompts/skills/f3s/references/freebsd-setup.md | 21 | ||||
| -rw-r--r-- | prompts/skills/f3s/references/hardware.md | 1 | ||||
| -rw-r--r-- | prompts/skills/f3s/references/observability.md | 116 |
3 files changed, 90 insertions, 48 deletions
diff --git a/prompts/skills/f3s/references/freebsd-setup.md b/prompts/skills/f3s/references/freebsd-setup.md index 343af0a..796520e 100644 --- a/prompts/skills/f3s/references/freebsd-setup.md +++ b/prompts/skills/f3s/references/freebsd-setup.md @@ -162,6 +162,27 @@ local snap jobs and remain owned by their push jobs. Config at `/usr/local/mimecast/etc/uptimed.conf` — `LOG_MAXIMUM_ENTRIES=0` (keep all records forever). Check with `uprecords`. +## Kernel Modules + +### coretemp (CPU temperature) + +`coretemp` provides real per-core die temps via Intel DTS. It is **not** loaded by default. Without it, `hw.acpi.thermal.tz0` is the only temperature source — it is often a constant lie (e.g. always 27.9°C) and cannot be used for thermal alerting. + +All f-hosts have `coretemp_load="YES"` in `/boot/loader.conf` (added 2026-05-17). This persists the module across reboots so Prometheus node_exporter can scrape `dev.cpu.N.temperature` automatically. + +To check: +```sh +sysctl dev.cpu | grep temperature # per-core die temps (requires coretemp) +kldstat | grep coretemp # verify module is loaded +grep coretemp /boot/loader.conf # verify persistence +``` + +To add manually if missing: +```sh +doas kldload coretemp +echo 'coretemp_load="YES"' | doas tee -a /boot/loader.conf +``` + ## Shell Default shell is `tcsh` (FreeBSD default). Run `rehash` after installing new packages for tcsh to find them. diff --git a/prompts/skills/f3s/references/hardware.md b/prompts/skills/f3s/references/hardware.md index fa7252e..67fec18 100644 --- a/prompts/skills/f3s/references/hardware.md +++ b/prompts/skills/f3s/references/hardware.md @@ -45,6 +45,7 @@ BIOS requirements for WoL: enable "Wake on LAN", disable "ERP Support", enable " | f0 | 192.168.1.130 | f0.lan.buetow.org | | f1 | 192.168.1.131 | f1.lan.buetow.org | | f2 | 192.168.1.132 | f2.lan.buetow.org | +| f3 | 192.168.1.133 | f3.lan.buetow.org | Static IPs configured at FreeBSD install time. Also in `/etc/hosts` on all nodes. diff --git a/prompts/skills/f3s/references/observability.md b/prompts/skills/f3s/references/observability.md index 0cfa7a9..4407b1b 100644 --- a/prompts/skills/f3s/references/observability.md +++ b/prompts/skills/f3s/references/observability.md @@ -2,20 +2,25 @@ ## Overview -Complete observability stack deployed into the `monitoring` namespace of the k3s cluster. +Observability stack deployed into the `monitoring` namespace of the k3s cluster. -Stack: **PLG + Tempo** (Prometheus, Loki, Grafana + Tempo for distributed tracing) +**Current state (as of 2026-05-16)**: Prometheus + Alloy only. Grafana, Loki, and Tempo are **disabled** — their ArgoCD manifests are renamed to `.disabled` and their pods do not run. + +- Grafana disabled: SQLite-on-NFS is fundamentally unreliable across pod restarts. Grafana's database gets locked when the pod reschedules to a different node. Long-term fix: migrate to local-path PVC (same pattern as navidrome). +- Loki/Tempo disabled: no log aggregation or distributed tracing until Grafana is re-enabled. +- Alloy is running but **only emits its own logs** (`logging { level = "info" }`). Log shipping to Loki and trace forwarding to Tempo are removed from its config. +- Prometheus TSDB was wiped and restarted clean (2026-05-16) after WAL corruption (zero-byte segments from a cluster blip). ## Components -| Component | Purpose | -|-----------|---------| -| **Prometheus** | Time-series metrics, alerting rules, Alertmanager | -| **Grafana** | Visualisation and dashboarding | -| **Loki** | Log aggregation (single-binary mode) | -| **Alloy** | Telemetry collector (DaemonSet) — ships logs to Loki, traces to Tempo | -| **Tempo** | Distributed tracing backend | -| **Node Exporter** | Host-level metrics (on k3s nodes AND FreeBSD hosts) | +| Component | Purpose | State | +|-----------|---------|-------| +| **Prometheus** | Time-series metrics, alerting rules, Alertmanager | **Running** | +| **Alloy** | Telemetry collector (DaemonSet) | **Running** (minimal config only) | +| **Node Exporter** | Host-level metrics (on k3s nodes AND FreeBSD hosts) | **Running** | +| **Grafana** | Visualisation and dashboarding | **Disabled** (SQLite-on-NFS) | +| **Loki** | Log aggregation (single-binary mode) | **Disabled** | +| **Tempo** | Distributed tracing backend | **Disabled** | ## Deployment @@ -33,17 +38,27 @@ Deployment tool: `just` (Justfile in each component directory). kubectl create namespace monitoring ``` -## Installing Prometheus + Grafana +### Disabled component manifests + +These files exist in the repo but are renamed `.disabled` so ArgoCD ignores them: +``` +f3s/argocd-apps/monitoring/loki.yaml.disabled +f3s/argocd-apps/monitoring/tempo.yaml.disabled +f3s/argocd-apps/monitoring/grafana-ingress.yaml.disabled +``` -Uses `kube-prometheus-stack` Helm chart: +To re-enable, rename back to `.yaml` and ensure Grafana is using a non-NFS PVC (local-path). + +## Installing Prometheus + +Uses `kube-prometheus-stack` Helm chart with **Grafana subchart disabled** (`grafana.enabled: false`): ```sh helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update -# Create NFS storage directories first +# Create NFS storage directory first mkdir -p /data/nfs/k3svolumes/prometheus/data -mkdir -p /data/nfs/k3svolumes/grafana/data cd conf/f3s/prometheus && just install ``` @@ -80,54 +95,45 @@ Default: `admin` / `prom-operator` — change immediately after first login. Grafana accessible at `grafana.f3s.foo.zone` via Traefik ingress. -## Installing Loki + Alloy +## Installing Alloy (minimal) + +Alloy is installed as part of the Loki Helm chart but runs with a minimal config (no log shipping): ```sh -mkdir -p /data/nfs/k3svolumes/loki/data cd conf/f3s/loki && just install -# installs both loki and alloy +# installs alloy only (loki itself is disabled via loki.yaml.disabled) ``` -Loki URL (internal): `http://loki.monitoring.svc.cluster.local:3100` - -Add Loki as Grafana data source: Configuration → Data Sources → Loki → URL above. +### Current Alloy config (`alloy-values.yaml`) -### Alloy configuration (`alloy-values.yaml`) +Minimal — only emits Alloy's own operational logs: ``` -discovery.kubernetes "pods" { - role = "pod" +logging { + level = "info" } +``` -discovery.relabel "pods" { - targets = discovery.kubernetes.pods.targets - rule { source_labels = ["__meta_kubernetes_namespace"]; target_label = "namespace" } - rule { source_labels = ["__meta_kubernetes_pod_name"]; target_label = "pod" } - rule { source_labels = ["__meta_kubernetes_pod_container_name"]; target_label = "container" } - rule { source_labels = ["__meta_kubernetes_pod_label_app"]; target_label = "app" } -} +To re-enable log shipping (once Loki is running again), restore the full `discovery.kubernetes` + `loki.source.kubernetes` + `loki.write` pipeline. -loki.source.kubernetes "pods" { - targets = discovery.relabel.pods.output - forward_to = [loki.write.default.receiver] -} +## Installing Loki (disabled) -loki.write "default" { - endpoint { - url = "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push" - } -} +```sh +mkdir -p /data/nfs/k3svolumes/loki/data +# Rename loki.yaml.disabled → loki.yaml first, then: +cd conf/f3s/loki && just install ``` -## Installing Tempo +Loki URL (internal): `http://loki.monitoring.svc.cluster.local:3100` + +## Installing Tempo (disabled) ```sh mkdir -p /data/nfs/k3svolumes/tempo/data +# Rename tempo.yaml.disabled → tempo.yaml first, then: cd conf/f3s/tempo && just install ``` -Add Tempo as Grafana data source: Grafana → Configuration → Data Sources → Tempo. - ## Monitoring FreeBSD Hosts (f0, f1, f2) ### Install node_exporter on FreeBSD @@ -247,8 +253,22 @@ Gogios scrapes Alertmanager at regular intervals and sends email notifications. - Node-level metrics (CPU, memory, disk) — both k3s and FreeBSD nodes - ZFS ARC statistics on FreeBSD hosts - Application performance metrics -- Log aggregation from all pods (via Alloy → Loki) -- Distributed traces (via Alloy → Tempo) +- ~~Log aggregation from all pods (via Alloy → Loki)~~ — disabled +- ~~Distributed traces (via Alloy → Tempo)~~ — disabled + +## Prometheus TSDB Recovery + +If Prometheus fails to start with `opening storage failed: get segment range: segments are not sequential`, WAL segments are corrupt (can happen after a cluster blip leaving zero-byte WAL files). + +Full TSDB wipe (loses all historical data — confirm first): + +```sh +# On the NFS server (f0 or CARP MASTER) +rm -rf /data/nfs/k3svolumes/prometheus/data/prometheus-db/ +mkdir -p /data/nfs/k3svolumes/prometheus/data/prometheus-db +chown 1000:1000 /data/nfs/k3svolumes/prometheus/data/prometheus-db +# Prometheus will recreate the TSDB on next start +``` ## Useful LogQL Queries @@ -266,8 +286,8 @@ Gogios scrapes Alertmanager at regular intervals and sends email notifications. ## NFS Storage Paths for Observability ``` -/data/nfs/k3svolumes/prometheus/data -/data/nfs/k3svolumes/grafana/data -/data/nfs/k3svolumes/loki/data -/data/nfs/k3svolumes/tempo/data +/data/nfs/k3svolumes/prometheus/data # active +/data/nfs/k3svolumes/grafana/data # exists but unused (grafana disabled) +/data/nfs/k3svolumes/loki/data # exists but unused (loki disabled) +/data/nfs/k3svolumes/tempo/data # exists but unused (tempo disabled) ``` |
