summaryrefslogtreecommitdiff
path: root/f3s/r-nodes
AgeCommit message (Collapse)Author
2026-05-16nfs-mount-monitor: switch to soft NFS mount + handle stale lockfilePaul Buetow
A hard NFS mount that fails enters uninterruptible kernel sleep (D-state) which SIGKILL cannot wake, so the recovery script hangs forever and the lockfile stays — silently disabling all subsequent health checks. Switch the remount to explicit soft,timeo=50,retrans=3 so the kernel gives up after ~15s, and detect/remove lockfiles older than 90s left behind by a SIGKILL'd predecessor.
2026-05-10nfs-monitor: make textfile .prom world-readable for node_exporterPaul Buetow
node_exporter runs as uid 65534 (nobody); mktemp creates files with mode 600 (root-only). Add chmod 644 before the atomic mv so the node_exporter process can read nfs_mount_monitor.prom on its scrape. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10nfs-monitor: add Prometheus alerts for NFS auto-repair failuresPaul Buetow
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on every run (via write_textfile_metric helper, called from write_fail_count and directly on healthy runs); atomic tmp+mv write prevents partial reads - Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes - prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path /var/lib/node_exporter/textfile_collector into container - persistence-values.yaml: sync node_exporter textfile_collector config - nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts: NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning) NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical) wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10nfs-mount-monitor: escalate to reboot after N consecutive fix_mount failuresPaul Buetow
Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count. Increment on every fix_mount failure; reset to 0 on any successful repair or when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s) consecutive failures the node is cordoned via kubectl and rebooted with 'systemctl reboot' so the cluster stops routing pods to a silently broken node. NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed as EnvironmentFile in the .service unit) without touching the script. Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the internal Rex loader path, not the Rexfile itself; use realpath($::rexfile) instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10nfs-mount-monitor: strengthen fix_mount recovery sequencePaul Buetow
Add lazy umount fallback, D-state process killer, stunnel restart, and 60-second hard deadline to prevent fix_mount from looping forever when processes are stuck in D state on a stale NFSv4-over-stunnel mount. Recovery sequence is now: 1. mount -o remount -f (cheap, no disruption) 2. kill_pinning_processes (SIGKILL D-state procs with nfs_ wchan) 3. umount -f (force unmount) 4. umount -l (lazy detach VFS node if -f failed) 5. systemctl restart stunnel + 2s sleep (refresh TLS transport) 6. mount (fresh mount) The 60s deadline uses bash $SECONDS so fix_mount can never outlast its own 10-second timer interval by an unbounded amount. Deployed to all three r-nodes (r0/r1/r2) via rex nfs_mount_monitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10nfs-mount-monitor: add write-probe to detect 'reads OK, writes hang' statePaul Buetow
Stunnel-wrapped NFSv4 can enter a half-broken state where mountpoint(1) returns true and stat(1) completes from cache, but ALL writes hang indefinitely. This was observed on r2 on 2026-05-10 causing navidrome to be unschedulable. The existing two probes passed while writes were dead. Add a third probe (write-probe) after the stat probe: write the shell's PID to a per-host .healthcheck.<hostname> file and immediately remove it, wrapped in a 5-second timeout. The per-host filename prevents r0/r1/r2 from racing on the same file. 5s gives one full NFS retransmit window (timeo=10 deciseconds = 1s, retrans=2) plus margin without making the 10-second timer run too long. Deployed to r0/r1/r2 via rex nfs_mount_monitor; all three nodes confirmed running the new script (journalctl shows clean exits). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10f3s/r-nodes: track NFS auto-repair script and systemd units in conf repoPaul Buetow
Pull check-nfs-mount.sh, nfs-mount-monitor.service, and nfs-mount-monitor.timer from r0/r1/r2 (confirmed identical on all three nodes) into f3s/r-nodes/nfs-mount-monitor/. Add f3s/r-nodes/Rexfile with an idempotent nfs_mount_monitor task that pushes the files to all three r-nodes as root and reloads systemd when content changes. Wire the new Rexfile into the repo root Rexfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>