diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-10 10:37:53 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-10 10:37:53 +0300 |
| commit | 965e61016751d132fe83a8f44c6a1bf87d92b1a8 (patch) | |
| tree | c95a7067681512ff5a7a05a03c99e40d6db6ad3c /frontends/scripts | |
| parent | 3964965c8ad5eeee16d3338ded718bbd34e1c69d (diff) | |
nfs-mount-monitor: escalate to reboot after N consecutive fix_mount failures
Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count.
Increment on every fix_mount failure; reset to 0 on any successful repair or
when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s)
consecutive failures the node is cordoned via kubectl and rebooted with
'systemctl reboot' so the cluster stops routing pods to a silently broken node.
NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed
as EnvironmentFile in the .service unit) without touching the script.
Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the
internal Rex loader path, not the Rexfile itself; use realpath($::rexfile)
instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diffstat (limited to 'frontends/scripts')
0 files changed, 0 insertions, 0 deletions
