diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-17 08:43:38 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-17 08:43:38 +0300 |
| commit | 8d94f7982b63a1f7c971d9788c177f374abec102 (patch) | |
| tree | b146c947af11945b0aedf7b5a15e4f326459e8c5 | |
| parent | 0027a8fb123721e15fcc7eb7252b8b0f3b54456c (diff) | |
docs(f3s): add Thermal Troubleshooting section to storage.md
Documents the 2026-05-16 f0 cascade failure (thermal throttling + autotrim=off
+ aggressive zrepl interval → stuck TRIM, multi-second txg syncs, D-state rsync).
Covers symptoms, per-core temp checks via coretemp vs unreliable hw.acpi.thermal.tz0,
Beelink S12 Pro thermal specifics, and step-by-step remediation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| -rw-r--r-- | prompts/skills/f3s/references/storage.md | 229 |
1 files changed, 226 insertions, 3 deletions
diff --git a/prompts/skills/f3s/references/storage.md b/prompts/skills/f3s/references/storage.md index 2c9cca6..0658dd0 100644 --- a/prompts/skills/f3s/references/storage.md +++ b/prompts/skills/f3s/references/storage.md @@ -22,6 +22,55 @@ On f0 and f1, create the zdata pool on the second SSD: doas zpool create zdata ada1 # ada1 = second SSD ``` +## SSD TRIM Configuration + +All f-hosts run on consumer SATA SSDs without power-loss protection +(SanDisk Ultra 3D, Samsung 870 EVO, Crucial BX500). Without TRIM, the +SSD controller can't reclaim freed pages and write amplification +explodes — observed on f0 (2026-05-16) as txg sync times of 5-14 +seconds (should be <100 ms) and per-op latency of 374 ms (should be +<5 ms on an SSD). The encrypted dataset makes this worse because +AES-256-GCM ciphertext is full-entropy and the controller can't +opportunistically reclaim space. + +Enable `autotrim` on every pool on every f-host (`zdata` and `zroot` +on f0/f1/f2; `zroot` only on f3): + +```sh +# Persisted in pool metadata — survives reboot +for pool in $(zpool list -H -o name); do + doas zpool set autotrim=on "$pool" +done +``` + +After turning autotrim on for the first time (or on a pool that has +never been trimmed), run a one-shot pool-wide TRIM to catch up on all +the historical free space the controller has been managing blind: + +```sh +for pool in $(zpool list -H -o name); do + doas zpool trim "$pool" # async; monitor with `zpool status -t` +done +``` + +Caveat: `zpool trim` runs at low ZFS priority. On a heavily-loaded +disk (active rsync, frequent zrepl snapshots, bhyve VM under load) it +can stall at 0% indefinitely because regular I/O never drains. +Quietening the workload first (kill rsync, raise zrepl `interval` from +`1m` to `15m`+, pause/cancel scrub) lets TRIM make progress; once +caught up, autotrim keeps it steady-state in the background. + +Verify across the fleet: + +```sh +for h in f0 f1 f2 f3; do + printf '%-3s ' "$h" + ssh "$h" "sh -c 'for p in \$(zpool list -H -o name); do \ + printf \"%s=%s \" \"\$p\" \"\$(zpool get -H -o value autotrim \$p)\"; \ + done; echo'" +done +``` + ## ZFS Encryption Keys (USB Key Storage) Encryption keys are stored on USB flash drives (UFS-formatted, mounted at `/keys`). @@ -292,6 +341,24 @@ nc -zv 192.168.2.131 8888 doas zrepl status --mode raw | grep BytesReplicated ``` +**zrepl DL-state on f1 after mid-replication f0 reboot**: if f0 reboots while zrepl is +actively replicating, f1's `[zfskern]` thread can enter **DL state** (disk + locked). +Symptoms: `zpool list`, `zfs list`, `ls /data/nfs/` all hang indefinitely; `zfs set +readonly=off` may return immediately (the kernel path differs). To recover on f1: + +```sh +# Stop zrepl to release the replication lock +doas service zrepl stop + +# Wait ~30–60 s for the kernel state to drain; then verify +doas zpool list +doas zfs list +doas service zrepl start +``` + +If ZFS commands still hang after stopping zrepl, a reboot of f1 is required. +The NFS data is still available on f0 so k3s is unaffected during f1 recovery. + ## CARP: High-Availability VIP CARP (Common Address Redundancy Protocol) provides **VIP 192.168.1.138** that floats between f0 (primary) and f1 (standby). @@ -389,6 +456,21 @@ doas carp auto-failback disable # prevent auto-failback (for maintenance) doas carp auto-failback enable # re-enable auto-failback ``` +### CARP failover limitation when ZFS is suspended + +If f0's ZFS pool is SUSPENDED but f0's OS is still running, f0 remains CARP MASTER +(it keeps sending CARP advertisements). Attempts to manually demote f0 via: + +```sh +doas carp backup # may return exit=0 but has no effect +doas ifconfig re0 vhid 1 state backup # may return exit=1 silently +doas ifconfig re0 vhid 1 advskew 254 # may return exit=1 silently +``` + +…can all silently fail because the kernel has too many stuck IO threads blocking +the ifconfig ioctl path. The CARP VIP will **not** float to f1 in this case. +**Only a hard power cycle of f0 reliably triggers CARP failover.** + ### Auto-failback from f1 to f0 Script `/usr/local/bin/carp-auto-failback.sh` runs every minute via cron on f0. Checks: currently BACKUP? `/data/nfs` mounted? Marker file exists? Failback not blocked? If all conditions met, promotes f0 to MASTER. @@ -570,6 +652,114 @@ After NFS is restored on the server side, the `nfs-mount-monitor` systemd timer **Note:** The monitor catches three failure modes: missing mountpoint, stat hang (reads unresponsive), and **silent write hang** (reads OK but writes block — the hardest case, e.g. stunnel-wrapped NFSv4 after a CARP failover). Watch the consecutive-failure counter via Prometheus (`nfs_mount_monitor_consecutive_failures`) — warning fires at ≥3, critical at ≥5. At 5 consecutive failures the node cordons itself and reboots. +### ZFS pool SUSPENDED recovery + +**Symptoms**: `doas zpool status zdata` shows `state: SUSPENDED`. All IO to the pool is +halted — ZFS suspends itself to prevent corruption when IO errors exceed the threshold. +Commands like `zpool clear`, `zpool scrub`, `zpool offline`, and even `ls /data/nfs/` hang +indefinitely because they wait for kernel IO that will never complete. + +**Known cause (2026-05-15)**: Samsung 870 EVO 1TB on f0 (ada1) hit 107 read errors and +105M+ write errors during normal operation — likely thermal throttling or a momentary +SATA connection loss. A previous resilver on 2026-01-27 suggests the drive has been +marginal for months. + +**Recovery — hard power cycle only**: +- Do NOT attempt `doas shutdown -r now` — if ZFS is suspended, the graceful shutdown hangs + at ZFS pool export and may stay stuck for 30–60+ minutes. +- Do NOT attempt `doas zpool clear zdata` — it hangs because ada1 is unresponsive. +- Do NOT attempt `doas ifconfig re0 vhid 1 state backup` or `doas carp backup` to fail + over to f1 first — these ifconfig ioctls can also be blocked when the kernel has too + many stuck IO threads. They may return exit=1 silently. +- **Hard power cycle** (pull power or hold the power button) resolves the issue in ~9 s + (Rocky Linux VMs come up automatically, ZFS pool imports cleanly on next boot). + +**Post-recovery**: +```sh +# 1. Verify pool health +doas zpool status zdata # should show ONLINE, 0 errors + +# 2. Check SMART for drive health +doas smartctl -a /dev/ada1 | grep -iE '(temperature|reallocated|pending|uncorrectable|error)' + +# 3. Start a scrub to verify data integrity +doas zpool scrub zdata +doas zpool status zdata # monitor; "scrub repaired 0 in ..." means data intact + +# 4. Verify NFS is serving (stunnel listening on CARP VIP) +doas sockstat -l | grep 2323 +``` + +**After cluster recovery**: +- Check for cordoned nodes: `kubectl get nodes` — if r0/r1/r2 show `SchedulingDisabled`, + uncordon them (see nfs-mount-monitor escalation section above). +- Reset fail counters on all r-nodes: `echo 0 > /var/lib/nfs-mount-monitor/fail-count` + +**Temperature monitoring** to detect thermal issues before they cause pool suspension: +```sh +# FreeBSD: load coretemp for CPU package temperature +doas kldload coretemp +sysctl -a | grep temperature # hw.acpi.thermal.*: and dev.cpu.*: +# Persist across reboots +echo 'coretemp_load="YES"' | doas tee -a /boot/loader.conf + +# SSD temperature (install smartmontools if absent) +doas pkg install -y smartmontools +doas smartctl -a /dev/ada1 | grep -i temperature # "194 Temperature_Celsius" +``` + +## Thermal Troubleshooting + +### Symptoms of thermal throttling on f-hosts + +- SSD I/O slowness (writes dropping from MB/s to KB/s) +- ZFS txg sync times jumping from <100ms to 5-37 seconds +- `zpool trim` stuck at 0% or paused indefinitely +- rsync / zrepl jobs going into D-state (waiting on ZFS I/O) +- High system CPU (80%+) from encryption overhead (ZFS native AES-256-GCM) + +### How to check temperatures + +- **coretemp (real per-core die temps)**: `kldload coretemp; sysctl dev.cpu | grep temperature` + - Should now auto-load via `/boot/loader.conf` (`coretemp_load="YES"`) +- **hw.acpi.thermal.tz0**: Often a constant lie (e.g. always 27.9°C) — do NOT rely on it +- **SSD temperature**: `smartctl -a /dev/adaN` (requires smartmontools; may not be installed) +- **Disk I/O performance**: `gstat -bp -I 1s -d` (FreeBSD gstat, not Linux iostat) +- **ZFS txg sync times**: `zpool events | grep -i sync` or check via `zpool status -v` + +### Beelink S12 Pro specifics + +- Small enclosure with passive/minimal cooling — heat accumulates fast under sustained load +- N100 CPU: normal idle ~40-55°C, warn >70°C idle, critical >85°C under load +- NVMe sits close to CPU — both heat each other in the small chassis +- Enclosure gets hot to the touch before temps fully register in software + +### Cascade failure pattern (2026-05-16 f0 incident) + +The following cascade was observed: + +1. Hot enclosure (NVMe physically very hot) → SSD thermal throttling +2. Concurrent rsync + 1-min zrepl snapshots + paused scrub → high I/O demand +3. autotrim=off (never trimmed) → SSD write amplification → further slowdown +4. ZFS native AES-256-GCM encryption → high CPU per I/O → txg sync times 5-37s +5. TRIM stuck at 0% for hours (couldn't make progress under continuous I/O load) +6. rsync went into D-state waiting on ZFS → appeared "hung" + +**Root causes**: (a) autotrim=off (SSD never trimmed); (b) hot enclosure + thermal throttling; +(c) zrepl snapshot interval too aggressive (1m). + +**Resolution**: Reseat/inspect drive + enclosure. After hardware fix, autotrim=on enabled, +manual TRIM ran to completion at ~2.4 GB/s. See "SSD TRIM Configuration" section. + +### Remediation steps + +1. SSH in and check temps: `kldload coretemp && sysctl dev.cpu | grep temperature` +2. If >80°C: stop heavy I/O workloads immediately (`service zrepl stop`, cancel scrubs) +3. Physical: shut down, reseat NVMe, clean dust from vents, improve airflow +4. After hardware fix: enable autotrim (`zpool set autotrim=on <pool>`) and run `zpool trim <pool>` +5. Monitor trim progress: `zpool status | grep trim` +6. Persist coretemp: ensure `/boot/loader.conf` has `coretemp_load="YES"` (see task 95) + ### Checklist for NFS outage on CARP MASTER (f0 or f1) ```sh @@ -631,7 +821,11 @@ If any probe fails, `fix_mount` runs: 3. `umount -f` (force unmount) 4. `umount -l` (lazy detach VFS node if `-f` failed) 5. `systemctl restart stunnel` + 2s sleep (refresh the TLS transport) -6. `mount` (fresh mount via stunnel) +6. `mount -t nfs4 -o port=2323,soft,timeo=50,retrans=3` (explicit soft NFS mount — NOT + `mount $MOUNT_POINT` which reads fstab's `hard` flag and enters uninterruptible D-state + if the server is unreachable; SIGKILL cannot wake a D-state process on Linux; + `soft,timeo=50,retrans=3` returns ETIMEDOUT after ~15 s so the fail counter can + increment and eventually trigger the reboot escalation) A hard **60-second deadline** prevents `fix_mount` from outlasting its own timer interval. @@ -641,7 +835,13 @@ Unknown / Pending / ContainerCreating so the kubelet can reschedule them. **Consecutive-failure escalation**: each `fix_mount` failure increments a counter persisted to `/var/lib/nfs-mount-monitor/fail-count`. At `NFS_FAIL_THRESHOLD=5` consecutive failures (~50 s), the node cordons itself (`kubectl cordon`) and issues -`systemctl reboot`. +`systemctl reboot`. The cordon is stored in etcd and **persists across reboots** — +after the underlying NFS issue is resolved, manually uncordon each affected node: +```sh +kubectl uncordon r0.lan.buetow.org +kubectl uncordon r1.lan.buetow.org +kubectl uncordon r2.lan.buetow.org +``` The counter is also exported to `/var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom` so Prometheus can alert on `nfs_mount_monitor_consecutive_failures` without parsing @@ -649,7 +849,11 @@ journal logs (warning ≥3, critical ≥5 — see `f3s/prometheus/manifests/nfs-mount-monitor-alerts.yaml`). Uses a lock file (`/var/run/nfs-mount-check.lock`) to prevent overlapping runs -since the timer fires faster than the script's worst-case runtime. +since the timer fires faster than the script's worst-case runtime. If the lock is +older than **90 seconds** it was left by a run that was SIGKILLed before its EXIT +trap could clean up (systemd kills with SIGKILL after its own timeout, bypassing +`trap "rm -f $LOCK_FILE" EXIT`); the stale lock is removed and the run continues, +preventing all health checks from being silently skipped forever. ### Timer configuration @@ -659,6 +863,25 @@ since the timer fires faster than the script's worst-case runtime. | `OnUnitActiveSec` | 10s | Check interval; each run is bounded by a 60-second deadline | | `AccuracySec` | 1s | Prevent systemd batching from delaying the 10 s interval | +### Managing the monitor during an extended NFS outage + +During a prolonged NFS outage (e.g. while the storage host is being power-cycled or +repaired), stop the timer on affected r-nodes to prevent the escalation counter from +reaching the auto-reboot threshold prematurely: + +```sh +# On each affected r-node (as root) +systemctl stop nfs-mount-monitor.timer +echo 0 > /var/lib/nfs-mount-monitor/fail-count # reset counter + +# After NFS is restored, restart and verify +systemctl start nfs-mount-monitor.timer +journalctl -u nfs-mount-monitor -f +``` + +Also reset the counter to 0 after uncordoning nodes (see escalation section above), +because the old counter value would lower the effective threshold for the next outage. + ### Status and logs ```sh |
