summaryrefslogtreecommitdiff
path: root/prompts
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-05-17 08:46:17 +0300
committerPaul Buetow <paul@buetow.org>2026-05-17 08:46:17 +0300
commitca37dfb4e3f5a8e4af00343fa262874da4e1080f (patch)
tree8fbb93f742bf9baaff6500b6921d4a2894f9378e /prompts
parent8d94f7982b63a1f7c971d9788c177f374abec102 (diff)
f3s skill: update observability, freebsd-setup, hardware refs
- observability.md: reflect current state — Grafana/Loki/Tempo disabled (SQLite-on-NFS unreliable), Alloy running with minimal config only, Prometheus TSDB wiped and restarted clean (2026-05-16 incident). Add TSDB recovery runbook. Document .disabled manifest pattern. - freebsd-setup.md: add coretemp kernel module section — all f-hosts now have coretemp_load="YES" in /boot/loader.conf for persistent per-core die temps (hw.acpi.thermal.tz0 is unreliable). - hardware.md: add f3 IP (192.168.1.133) to LAN table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diffstat (limited to 'prompts')
-rw-r--r--prompts/skills/f3s/references/freebsd-setup.md21
-rw-r--r--prompts/skills/f3s/references/hardware.md1
-rw-r--r--prompts/skills/f3s/references/observability.md116
3 files changed, 90 insertions, 48 deletions
diff --git a/prompts/skills/f3s/references/freebsd-setup.md b/prompts/skills/f3s/references/freebsd-setup.md
index 343af0a..796520e 100644
--- a/prompts/skills/f3s/references/freebsd-setup.md
+++ b/prompts/skills/f3s/references/freebsd-setup.md
@@ -162,6 +162,27 @@ local snap jobs and remain owned by their push jobs.
Config at `/usr/local/mimecast/etc/uptimed.conf` — `LOG_MAXIMUM_ENTRIES=0` (keep all records forever).
Check with `uprecords`.
+## Kernel Modules
+
+### coretemp (CPU temperature)
+
+`coretemp` provides real per-core die temps via Intel DTS. It is **not** loaded by default. Without it, `hw.acpi.thermal.tz0` is the only temperature source — it is often a constant lie (e.g. always 27.9°C) and cannot be used for thermal alerting.
+
+All f-hosts have `coretemp_load="YES"` in `/boot/loader.conf` (added 2026-05-17). This persists the module across reboots so Prometheus node_exporter can scrape `dev.cpu.N.temperature` automatically.
+
+To check:
+```sh
+sysctl dev.cpu | grep temperature # per-core die temps (requires coretemp)
+kldstat | grep coretemp # verify module is loaded
+grep coretemp /boot/loader.conf # verify persistence
+```
+
+To add manually if missing:
+```sh
+doas kldload coretemp
+echo 'coretemp_load="YES"' | doas tee -a /boot/loader.conf
+```
+
## Shell
Default shell is `tcsh` (FreeBSD default). Run `rehash` after installing new packages for tcsh to find them.
diff --git a/prompts/skills/f3s/references/hardware.md b/prompts/skills/f3s/references/hardware.md
index fa7252e..67fec18 100644
--- a/prompts/skills/f3s/references/hardware.md
+++ b/prompts/skills/f3s/references/hardware.md
@@ -45,6 +45,7 @@ BIOS requirements for WoL: enable "Wake on LAN", disable "ERP Support", enable "
| f0 | 192.168.1.130 | f0.lan.buetow.org |
| f1 | 192.168.1.131 | f1.lan.buetow.org |
| f2 | 192.168.1.132 | f2.lan.buetow.org |
+| f3 | 192.168.1.133 | f3.lan.buetow.org |
Static IPs configured at FreeBSD install time. Also in `/etc/hosts` on all nodes.
diff --git a/prompts/skills/f3s/references/observability.md b/prompts/skills/f3s/references/observability.md
index 0cfa7a9..4407b1b 100644
--- a/prompts/skills/f3s/references/observability.md
+++ b/prompts/skills/f3s/references/observability.md
@@ -2,20 +2,25 @@
## Overview
-Complete observability stack deployed into the `monitoring` namespace of the k3s cluster.
+Observability stack deployed into the `monitoring` namespace of the k3s cluster.
-Stack: **PLG + Tempo** (Prometheus, Loki, Grafana + Tempo for distributed tracing)
+**Current state (as of 2026-05-16)**: Prometheus + Alloy only. Grafana, Loki, and Tempo are **disabled** — their ArgoCD manifests are renamed to `.disabled` and their pods do not run.
+
+- Grafana disabled: SQLite-on-NFS is fundamentally unreliable across pod restarts. Grafana's database gets locked when the pod reschedules to a different node. Long-term fix: migrate to local-path PVC (same pattern as navidrome).
+- Loki/Tempo disabled: no log aggregation or distributed tracing until Grafana is re-enabled.
+- Alloy is running but **only emits its own logs** (`logging { level = "info" }`). Log shipping to Loki and trace forwarding to Tempo are removed from its config.
+- Prometheus TSDB was wiped and restarted clean (2026-05-16) after WAL corruption (zero-byte segments from a cluster blip).
## Components
-| Component | Purpose |
-|-----------|---------|
-| **Prometheus** | Time-series metrics, alerting rules, Alertmanager |
-| **Grafana** | Visualisation and dashboarding |
-| **Loki** | Log aggregation (single-binary mode) |
-| **Alloy** | Telemetry collector (DaemonSet) — ships logs to Loki, traces to Tempo |
-| **Tempo** | Distributed tracing backend |
-| **Node Exporter** | Host-level metrics (on k3s nodes AND FreeBSD hosts) |
+| Component | Purpose | State |
+|-----------|---------|-------|
+| **Prometheus** | Time-series metrics, alerting rules, Alertmanager | **Running** |
+| **Alloy** | Telemetry collector (DaemonSet) | **Running** (minimal config only) |
+| **Node Exporter** | Host-level metrics (on k3s nodes AND FreeBSD hosts) | **Running** |
+| **Grafana** | Visualisation and dashboarding | **Disabled** (SQLite-on-NFS) |
+| **Loki** | Log aggregation (single-binary mode) | **Disabled** |
+| **Tempo** | Distributed tracing backend | **Disabled** |
## Deployment
@@ -33,17 +38,27 @@ Deployment tool: `just` (Justfile in each component directory).
kubectl create namespace monitoring
```
-## Installing Prometheus + Grafana
+### Disabled component manifests
+
+These files exist in the repo but are renamed `.disabled` so ArgoCD ignores them:
+```
+f3s/argocd-apps/monitoring/loki.yaml.disabled
+f3s/argocd-apps/monitoring/tempo.yaml.disabled
+f3s/argocd-apps/monitoring/grafana-ingress.yaml.disabled
+```
-Uses `kube-prometheus-stack` Helm chart:
+To re-enable, rename back to `.yaml` and ensure Grafana is using a non-NFS PVC (local-path).
+
+## Installing Prometheus
+
+Uses `kube-prometheus-stack` Helm chart with **Grafana subchart disabled** (`grafana.enabled: false`):
```sh
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
-# Create NFS storage directories first
+# Create NFS storage directory first
mkdir -p /data/nfs/k3svolumes/prometheus/data
-mkdir -p /data/nfs/k3svolumes/grafana/data
cd conf/f3s/prometheus && just install
```
@@ -80,54 +95,45 @@ Default: `admin` / `prom-operator` — change immediately after first login.
Grafana accessible at `grafana.f3s.foo.zone` via Traefik ingress.
-## Installing Loki + Alloy
+## Installing Alloy (minimal)
+
+Alloy is installed as part of the Loki Helm chart but runs with a minimal config (no log shipping):
```sh
-mkdir -p /data/nfs/k3svolumes/loki/data
cd conf/f3s/loki && just install
-# installs both loki and alloy
+# installs alloy only (loki itself is disabled via loki.yaml.disabled)
```
-Loki URL (internal): `http://loki.monitoring.svc.cluster.local:3100`
-
-Add Loki as Grafana data source: Configuration → Data Sources → Loki → URL above.
+### Current Alloy config (`alloy-values.yaml`)
-### Alloy configuration (`alloy-values.yaml`)
+Minimal — only emits Alloy's own operational logs:
```
-discovery.kubernetes "pods" {
- role = "pod"
+logging {
+ level = "info"
}
+```
-discovery.relabel "pods" {
- targets = discovery.kubernetes.pods.targets
- rule { source_labels = ["__meta_kubernetes_namespace"]; target_label = "namespace" }
- rule { source_labels = ["__meta_kubernetes_pod_name"]; target_label = "pod" }
- rule { source_labels = ["__meta_kubernetes_pod_container_name"]; target_label = "container" }
- rule { source_labels = ["__meta_kubernetes_pod_label_app"]; target_label = "app" }
-}
+To re-enable log shipping (once Loki is running again), restore the full `discovery.kubernetes` + `loki.source.kubernetes` + `loki.write` pipeline.
-loki.source.kubernetes "pods" {
- targets = discovery.relabel.pods.output
- forward_to = [loki.write.default.receiver]
-}
+## Installing Loki (disabled)
-loki.write "default" {
- endpoint {
- url = "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push"
- }
-}
+```sh
+mkdir -p /data/nfs/k3svolumes/loki/data
+# Rename loki.yaml.disabled → loki.yaml first, then:
+cd conf/f3s/loki && just install
```
-## Installing Tempo
+Loki URL (internal): `http://loki.monitoring.svc.cluster.local:3100`
+
+## Installing Tempo (disabled)
```sh
mkdir -p /data/nfs/k3svolumes/tempo/data
+# Rename tempo.yaml.disabled → tempo.yaml first, then:
cd conf/f3s/tempo && just install
```
-Add Tempo as Grafana data source: Grafana → Configuration → Data Sources → Tempo.
-
## Monitoring FreeBSD Hosts (f0, f1, f2)
### Install node_exporter on FreeBSD
@@ -247,8 +253,22 @@ Gogios scrapes Alertmanager at regular intervals and sends email notifications.
- Node-level metrics (CPU, memory, disk) — both k3s and FreeBSD nodes
- ZFS ARC statistics on FreeBSD hosts
- Application performance metrics
-- Log aggregation from all pods (via Alloy → Loki)
-- Distributed traces (via Alloy → Tempo)
+- ~~Log aggregation from all pods (via Alloy → Loki)~~ — disabled
+- ~~Distributed traces (via Alloy → Tempo)~~ — disabled
+
+## Prometheus TSDB Recovery
+
+If Prometheus fails to start with `opening storage failed: get segment range: segments are not sequential`, WAL segments are corrupt (can happen after a cluster blip leaving zero-byte WAL files).
+
+Full TSDB wipe (loses all historical data — confirm first):
+
+```sh
+# On the NFS server (f0 or CARP MASTER)
+rm -rf /data/nfs/k3svolumes/prometheus/data/prometheus-db/
+mkdir -p /data/nfs/k3svolumes/prometheus/data/prometheus-db
+chown 1000:1000 /data/nfs/k3svolumes/prometheus/data/prometheus-db
+# Prometheus will recreate the TSDB on next start
+```
## Useful LogQL Queries
@@ -266,8 +286,8 @@ Gogios scrapes Alertmanager at regular intervals and sends email notifications.
## NFS Storage Paths for Observability
```
-/data/nfs/k3svolumes/prometheus/data
-/data/nfs/k3svolumes/grafana/data
-/data/nfs/k3svolumes/loki/data
-/data/nfs/k3svolumes/tempo/data
+/data/nfs/k3svolumes/prometheus/data # active
+/data/nfs/k3svolumes/grafana/data # exists but unused (grafana disabled)
+/data/nfs/k3svolumes/loki/data # exists but unused (loki disabled)
+/data/nfs/k3svolumes/tempo/data # exists but unused (tempo disabled)
```