docs(f3s): add Thermal Troubleshooting section to storage.md

Documents the 2026-05-16 f0 cascade failure (thermal throttling + autotrim=off + aggressive zrepl interval → stuck TRIM, multi-second txg syncs, D-state rsync). Covers symptoms, per-core temp checks via coretemp vs unreliable hw.acpi.thermal.tz0, Beelink S12 Pro thermal specifics, and step-by-step remediation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
author: Paul Buetow <paul@buetow.org> 2026-05-17 08:43:38 +0300
committer: Paul Buetow <paul@buetow.org> 2026-05-17 08:43:38 +0300
commit: 8d94f7982b63a1f7c971d9788c177f374abec102 (patch)
tree: b146c947af11945b0aedf7b5a15e4f326459e8c5
parent: 0027a8fb123721e15fcc7eb7252b8b0f3b54456c (diff)
1 files changed, 226 insertions, 3 deletions
diff --git a/prompts/skills/f3s/references/storage.md b/prompts/skills/f3s/references/storage.md
index 2c9cca6..0658dd0 100644
--- a/prompts/skills/f3s/references/storage.md
+++ b/prompts/skills/f3s/references/storage.md
@@ -22,6 +22,55 @@ On f0 and f1, create the zdata pool on the second SSD:
 doas zpool create zdata ada1   # ada1 = second SSD
 ```
 
+## SSD TRIM Configuration
+
+All f-hosts run on consumer SATA SSDs without power-loss protection
+(SanDisk Ultra 3D, Samsung 870 EVO, Crucial BX500). Without TRIM, the
+SSD controller can't reclaim freed pages and write amplification
+explodes — observed on f0 (2026-05-16) as txg sync times of 5-14
+seconds (should be <100 ms) and per-op latency of 374 ms (should be
+<5 ms on an SSD). The encrypted dataset makes this worse because
+AES-256-GCM ciphertext is full-entropy and the controller can't
+opportunistically reclaim space.
+
+Enable `autotrim` on every pool on every f-host (`zdata` and `zroot`
+on f0/f1/f2; `zroot` only on f3):
+
+```sh
+# Persisted in pool metadata — survives reboot
+for pool in $(zpool list -H -o name); do
+  doas zpool set autotrim=on "$pool"
+done
+```
+
+After turning autotrim on for the first time (or on a pool that has
+never been trimmed), run a one-shot pool-wide TRIM to catch up on all
+the historical free space the controller has been managing blind:
+
+```sh
+for pool in $(zpool list -H -o name); do
+  doas zpool trim "$pool"        # async; monitor with `zpool status -t`
+done
+```
+
+Caveat: `zpool trim` runs at low ZFS priority. On a heavily-loaded
+disk (active rsync, frequent zrepl snapshots, bhyve VM under load) it
+can stall at 0% indefinitely because regular I/O never drains.
+Quietening the workload first (kill rsync, raise zrepl `interval` from
+`1m` to `15m`+, pause/cancel scrub) lets TRIM make progress; once
+caught up, autotrim keeps it steady-state in the background.
+
+Verify across the fleet:
+
+```sh
+for h in f0 f1 f2 f3; do
+  printf '%-3s ' "$h"
+  ssh "$h" "sh -c 'for p in \$(zpool list -H -o name); do \
+    printf \"%s=%s \" \"\$p\" \"\$(zpool get -H -o value autotrim \$p)\"; \
+    done; echo'"
+done
+```
+
 ## ZFS Encryption Keys (USB Key Storage)
 
 Encryption keys are stored on USB flash drives (UFS-formatted, mounted at `/keys`).
@@ -292,6 +341,24 @@ nc -zv 192.168.2.131 8888
 doas zrepl status --mode raw | grep BytesReplicated
 ```
 
+**zrepl DL-state on f1 after mid-replication f0 reboot**: if f0 reboots while zrepl is
+actively replicating, f1's `[zfskern]` thread can enter **DL state** (disk + locked).
+Symptoms: `zpool list`, `zfs list`, `ls /data/nfs/` all hang indefinitely; `zfs set
+readonly=off` may return immediately (the kernel path differs). To recover on f1:
+
+```sh
+# Stop zrepl to release the replication lock
+doas service zrepl stop
+
+# Wait ~30–60 s for the kernel state to drain; then verify
+doas zpool list
+doas zfs list
+doas service zrepl start
+```
+
+If ZFS commands still hang after stopping zrepl, a reboot of f1 is required.
+The NFS data is still available on f0 so k3s is unaffected during f1 recovery.
+
 ## CARP: High-Availability VIP
 
 CARP (Common Address Redundancy Protocol) provides **VIP 192.168.1.138** that floats between f0 (primary) and f1 (standby).
@@ -389,6 +456,21 @@ doas carp auto-failback disable   # prevent auto-failback (for maintenance)
 doas carp auto-failback enable    # re-enable auto-failback
 ```
 
+### CARP failover limitation when ZFS is suspended
+
+If f0's ZFS pool is SUSPENDED but f0's OS is still running, f0 remains CARP MASTER
+(it keeps sending CARP advertisements). Attempts to manually demote f0 via:
+
+```sh
+doas carp backup                            # may return exit=0 but has no effect
+doas ifconfig re0 vhid 1 state backup       # may return exit=1 silently
+doas ifconfig re0 vhid 1 advskew 254        # may return exit=1 silently
+```
+
+…can all silently fail because the kernel has too many stuck IO threads blocking
+the ifconfig ioctl path. The CARP VIP will **not** float to f1 in this case.
+**Only a hard power cycle of f0 reliably triggers CARP failover.**
+
 ### Auto-failback from f1 to f0
 
 Script `/usr/local/bin/carp-auto-failback.sh` runs every minute via cron on f0. Checks: currently BACKUP? `/data/nfs` mounted? Marker file exists? Failback not blocked? If all conditions met, promotes f0 to MASTER.
@@ -570,6 +652,114 @@ After NFS is restored on the server side, the `nfs-mount-monitor` systemd timer
 
 **Note:** The monitor catches three failure modes: missing mountpoint, stat hang (reads unresponsive), and **silent write hang** (reads OK but writes block — the hardest case, e.g. stunnel-wrapped NFSv4 after a CARP failover). Watch the consecutive-failure counter via Prometheus (`nfs_mount_monitor_consecutive_failures`) — warning fires at ≥3, critical at ≥5. At 5 consecutive failures the node cordons itself and reboots.
 
+### ZFS pool SUSPENDED recovery
+
+**Symptoms**: `doas zpool status zdata` shows `state: SUSPENDED`. All IO to the pool is
+halted — ZFS suspends itself to prevent corruption when IO errors exceed the threshold.
+Commands like `zpool clear`, `zpool scrub`, `zpool offline`, and even `ls /data/nfs/` hang
+indefinitely because they wait for kernel IO that will never complete.
+
+**Known cause (2026-05-15)**: Samsung 870 EVO 1TB on f0 (ada1) hit 107 read errors and
+105M+ write errors during normal operation — likely thermal throttling or a momentary
+SATA connection loss. A previous resilver on 2026-01-27 suggests the drive has been
+marginal for months.
+
+**Recovery — hard power cycle only**:
+- Do NOT attempt `doas shutdown -r now` — if ZFS is suspended, the graceful shutdown hangs
+  at ZFS pool export and may stay stuck for 30–60+ minutes.
+- Do NOT attempt `doas zpool clear zdata` — it hangs because ada1 is unresponsive.
+- Do NOT attempt `doas ifconfig re0 vhid 1 state backup` or `doas carp backup` to fail
+  over to f1 first — these ifconfig ioctls can also be blocked when the kernel has too
+  many stuck IO threads. They may return exit=1 silently.
+- **Hard power cycle** (pull power or hold the power button) resolves the issue in ~9 s
+  (Rocky Linux VMs come up automatically, ZFS pool imports cleanly on next boot).
+
+**Post-recovery**:
+```sh
+# 1. Verify pool health
+doas zpool status zdata          # should show ONLINE, 0 errors
+
+# 2. Check SMART for drive health
+doas smartctl -a /dev/ada1 | grep -iE '(temperature|reallocated|pending|uncorrectable|error)'
+
+# 3. Start a scrub to verify data integrity
+doas zpool scrub zdata
+doas zpool status zdata          # monitor; "scrub repaired 0 in ..." means data intact
+
+# 4. Verify NFS is serving (stunnel listening on CARP VIP)
+doas sockstat -l | grep 2323
+```
+
+**After cluster recovery**:
+- Check for cordoned nodes: `kubectl get nodes` — if r0/r1/r2 show `SchedulingDisabled`,
+  uncordon them (see nfs-mount-monitor escalation section above).
+- Reset fail counters on all r-nodes: `echo 0 > /var/lib/nfs-mount-monitor/fail-count`
+
+**Temperature monitoring** to detect thermal issues before they cause pool suspension:
+```sh
+# FreeBSD: load coretemp for CPU package temperature
+doas kldload coretemp
+sysctl -a | grep temperature                      # hw.acpi.thermal.*: and dev.cpu.*:
+# Persist across reboots
+echo 'coretemp_load="YES"' | doas tee -a /boot/loader.conf
+
+# SSD temperature (install smartmontools if absent)
+doas pkg install -y smartmontools
+doas smartctl -a /dev/ada1 | grep -i temperature  # "194 Temperature_Celsius"
+```
+
+## Thermal Troubleshooting
+
+### Symptoms of thermal throttling on f-hosts
+
+- SSD I/O slowness (writes dropping from MB/s to KB/s)
+- ZFS txg sync times jumping from <100ms to 5-37 seconds
+- `zpool trim` stuck at 0% or paused indefinitely
+- rsync / zrepl jobs going into D-state (waiting on ZFS I/O)
+- High system CPU (80%+) from encryption overhead (ZFS native AES-256-GCM)
+
+### How to check temperatures
+
+- **coretemp (real per-core die temps)**: `kldload coretemp; sysctl dev.cpu | grep temperature`
+  - Should now auto-load via `/boot/loader.conf` (`coretemp_load="YES"`)
+- **hw.acpi.thermal.tz0**: Often a constant lie (e.g. always 27.9°C) — do NOT rely on it
+- **SSD temperature**: `smartctl -a /dev/adaN` (requires smartmontools; may not be installed)
+- **Disk I/O performance**: `gstat -bp -I 1s -d` (FreeBSD gstat, not Linux iostat)
+- **ZFS txg sync times**: `zpool events | grep -i sync` or check via `zpool status -v`
+
+### Beelink S12 Pro specifics
+
+- Small enclosure with passive/minimal cooling — heat accumulates fast under sustained load
+- N100 CPU: normal idle ~40-55°C, warn >70°C idle, critical >85°C under load
+- NVMe sits close to CPU — both heat each other in the small chassis
+- Enclosure gets hot to the touch before temps fully register in software
+
+### Cascade failure pattern (2026-05-16 f0 incident)
+
+The following cascade was observed:
+
+1. Hot enclosure (NVMe physically very hot) → SSD thermal throttling
+2. Concurrent rsync + 1-min zrepl snapshots + paused scrub → high I/O demand
+3. autotrim=off (never trimmed) → SSD write amplification → further slowdown
+4. ZFS native AES-256-GCM encryption → high CPU per I/O → txg sync times 5-37s
+5. TRIM stuck at 0% for hours (couldn't make progress under continuous I/O load)
+6. rsync went into D-state waiting on ZFS → appeared "hung"
+
+**Root causes**: (a) autotrim=off (SSD never trimmed); (b) hot enclosure + thermal throttling;
+(c) zrepl snapshot interval too aggressive (1m).
+
+**Resolution**: Reseat/inspect drive + enclosure. After hardware fix, autotrim=on enabled,
+manual TRIM ran to completion at ~2.4 GB/s. See "SSD TRIM Configuration" section.
+
+### Remediation steps
+
+1. SSH in and check temps: `kldload coretemp && sysctl dev.cpu | grep temperature`
+2. If >80°C: stop heavy I/O workloads immediately (`service zrepl stop`, cancel scrubs)
+3. Physical: shut down, reseat NVMe, clean dust from vents, improve airflow
+4. After hardware fix: enable autotrim (`zpool set autotrim=on <pool>`) and run `zpool trim <pool>`
+5. Monitor trim progress: `zpool status | grep trim`
+6. Persist coretemp: ensure `/boot/loader.conf` has `coretemp_load="YES"` (see task 95)
+
 ### Checklist for NFS outage on CARP MASTER (f0 or f1)
 
 ```sh
@@ -631,7 +821,11 @@ If any probe fails, `fix_mount` runs:
 3. `umount -f` (force unmount)
 4. `umount -l` (lazy detach VFS node if `-f` failed)
 5. `systemctl restart stunnel` + 2s sleep (refresh the TLS transport)
-6. `mount` (fresh mount via stunnel)
+6. `mount -t nfs4 -o port=2323,soft,timeo=50,retrans=3` (explicit soft NFS mount — NOT
+   `mount $MOUNT_POINT` which reads fstab's `hard` flag and enters uninterruptible D-state
+   if the server is unreachable; SIGKILL cannot wake a D-state process on Linux;
+   `soft,timeo=50,retrans=3` returns ETIMEDOUT after ~15 s so the fail counter can
+   increment and eventually trigger the reboot escalation)
 
 A hard **60-second deadline** prevents `fix_mount` from outlasting its own timer interval.
 
@@ -641,7 +835,13 @@ Unknown / Pending / ContainerCreating so the kubelet can reschedule them.
 **Consecutive-failure escalation**: each `fix_mount` failure increments a counter
 persisted to `/var/lib/nfs-mount-monitor/fail-count`. At `NFS_FAIL_THRESHOLD=5`
 consecutive failures (~50 s), the node cordons itself (`kubectl cordon`) and issues
-`systemctl reboot`.
+`systemctl reboot`. The cordon is stored in etcd and **persists across reboots** —
+after the underlying NFS issue is resolved, manually uncordon each affected node:
+```sh
+kubectl uncordon r0.lan.buetow.org
+kubectl uncordon r1.lan.buetow.org
+kubectl uncordon r2.lan.buetow.org
+```
 
 The counter is also exported to `/var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom`
 so Prometheus can alert on `nfs_mount_monitor_consecutive_failures` without parsing
@@ -649,7 +849,11 @@ journal logs (warning ≥3, critical ≥5 — see
 `f3s/prometheus/manifests/nfs-mount-monitor-alerts.yaml`).
 
 Uses a lock file (`/var/run/nfs-mount-check.lock`) to prevent overlapping runs
-since the timer fires faster than the script's worst-case runtime.
+since the timer fires faster than the script's worst-case runtime. If the lock is
+older than **90 seconds** it was left by a run that was SIGKILLed before its EXIT
+trap could clean up (systemd kills with SIGKILL after its own timeout, bypassing
+`trap "rm -f $LOCK_FILE" EXIT`); the stale lock is removed and the run continues,
+preventing all health checks from being silently skipped forever.
 
 ### Timer configuration
 
@@ -659,6 +863,25 @@ since the timer fires faster than the script's worst-case runtime.
 | `OnUnitActiveSec` | 10s | Check interval; each run is bounded by a 60-second deadline |
 | `AccuracySec` | 1s | Prevent systemd batching from delaying the 10 s interval |
 
+### Managing the monitor during an extended NFS outage
+
+During a prolonged NFS outage (e.g. while the storage host is being power-cycled or
+repaired), stop the timer on affected r-nodes to prevent the escalation counter from
+reaching the auto-reboot threshold prematurely:
+
+```sh
+# On each affected r-node (as root)
+systemctl stop nfs-mount-monitor.timer
+echo 0 > /var/lib/nfs-mount-monitor/fail-count   # reset counter
+
+# After NFS is restored, restart and verify
+systemctl start nfs-mount-monitor.timer
+journalctl -u nfs-mount-monitor -f
+```
+
+Also reset the counter to 0 after uncordoning nodes (see escalation section above),
+because the old counter value would lower the effective threshold for the next outage.
+
 ### Status and logs
 
 ```sh
author	Paul Buetow <paul@buetow.org>	2026-05-17 08:43:38 +0300
committer	Paul Buetow <paul@buetow.org>	2026-05-17 08:43:38 +0300
commit	8d94f7982b63a1f7c971d9788c177f374abec102 (patch)
tree	b146c947af11945b0aedf7b5a15e4f326459e8c5
parent	0027a8fb123721e15fcc7eb7252b8b0f3b54456c (diff)