summaryrefslogtreecommitdiff
path: root/f3s/r-nodes/Rexfile
AgeCommit message (Collapse)Author
2026-05-10nfs-monitor: add Prometheus alerts for NFS auto-repair failuresPaul Buetow
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on every run (via write_textfile_metric helper, called from write_fail_count and directly on healthy runs); atomic tmp+mv write prevents partial reads - Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes - prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path /var/lib/node_exporter/textfile_collector into container - persistence-values.yaml: sync node_exporter textfile_collector config - nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts: NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning) NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical) wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10nfs-mount-monitor: escalate to reboot after N consecutive fix_mount failuresPaul Buetow
Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count. Increment on every fix_mount failure; reset to 0 on any successful repair or when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s) consecutive failures the node is cordoned via kubectl and rebooted with 'systemctl reboot' so the cluster stops routing pods to a silently broken node. NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed as EnvironmentFile in the .service unit) without touching the script. Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the internal Rex loader path, not the Rexfile itself; use realpath($::rexfile) instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10f3s/r-nodes: track NFS auto-repair script and systemd units in conf repoPaul Buetow
Pull check-nfs-mount.sh, nfs-mount-monitor.service, and nfs-mount-monitor.timer from r0/r1/r2 (confirmed identical on all three nodes) into f3s/r-nodes/nfs-mount-monitor/. Add f3s/r-nodes/Rexfile with an idempotent nfs_mount_monitor task that pushes the files to all three r-nodes as root and reloads systemd when content changes. Wire the new Rexfile into the repo root Rexfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>