diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-10 10:42:59 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-10 10:42:59 +0300 |
| commit | f8179f12afd53f3ce7f8a9f13155ecdef7c7382b (patch) | |
| tree | d2284606a2bb73112d33389226b15dd063a0dde6 /frontends/scripts | |
| parent | 965e61016751d132fe83a8f44c6a1bf87d92b1a8 (diff) | |
nfs-monitor: add Prometheus alerts for NFS auto-repair failures
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge
to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on
every run (via write_textfile_metric helper, called from write_fail_count
and directly on healthy runs); atomic tmp+mv write prevents partial reads
- Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes
- prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter
DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path
/var/lib/node_exporter/textfile_collector into container
- persistence-values.yaml: sync node_exporter textfile_collector config
- nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts:
NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning)
NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical)
wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval
Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diffstat (limited to 'frontends/scripts')
0 files changed, 0 insertions, 0 deletions
