diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-10 10:25:28 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-10 10:25:28 +0300 |
| commit | 360d06ceb88d0965aeb7dc9831f48725ad6ebc89 (patch) | |
| tree | 7d939044e4d3a1fb597b161869af630786536443 | |
| parent | 7940bfbd7456e0c65f75b5d80bf88bd05a2484dd (diff) | |
f3s/storage: document NFS auto-repair subsystem and fix stale cron reference
Update 'Pods stuck' troubleshooting entry: the script is now driven by
a systemd timer (10 s) rather than a cron job (1 min). Add a new
'NFS Auto-Repair: nfs-mount-monitor' section documenting the repo
layout (f3s/r-nodes/nfs-mount-monitor/), the Rex deploy command, what
the script does, and timer configuration.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| -rw-r--r-- | prompts/skills/f3s/references/storage.md | 49 |
1 files changed, 48 insertions, 1 deletions
diff --git a/prompts/skills/f3s/references/storage.md b/prompts/skills/f3s/references/storage.md index ea77260..958a91d 100644 --- a/prompts/skills/f3s/references/storage.md +++ b/prompts/skills/f3s/references/storage.md @@ -566,7 +566,7 @@ NFS path structure on k3s nodes: `/data/nfs/k3svolumes/<app>/` ### Pods stuck in ContainerCreating/Unknown after NFS recovery -After NFS is restored on the server side, the r-nodes' cron job (`check-nfs-mount.sh`) will auto-remount within 1 minute and force-delete stuck pods. If immediate recovery is needed: `mount /data/nfs/k3svolumes` on each r-node, then delete the stuck pods manually. +After NFS is restored on the server side, the `nfs-mount-monitor` systemd timer on each r-node will auto-remount within ~10 seconds and force-delete stuck pods. If immediate recovery is needed: `mount /data/nfs/k3svolumes` on each r-node, then delete the stuck pods manually. ### Checklist for NFS outage on CARP MASTER (f0 or f1) @@ -589,6 +589,53 @@ doas service nfsd restart doas service stunnel start ``` +## NFS Auto-Repair: nfs-mount-monitor + +A systemd timer+service pair on r0/r1/r2 checks the NFS mount every 10 seconds and automatically repairs it if stale or missing. + +### Repo location + +``` +f3s/r-nodes/nfs-mount-monitor/ + check-nfs-mount.sh # repair script → /usr/local/bin/ + nfs-mount-monitor.service # one-shot service → /etc/systemd/system/ + nfs-mount-monitor.timer # 10-second timer → /etc/systemd/system/ +f3s/r-nodes/Rexfile # Rex deploy task: nfs_mount_monitor +``` + +### Deploy + +```sh +# From repo root — pushes to all three r-nodes and reloads systemd if anything changed +rex -f f3s/r-nodes/Rexfile nfs_mount_monitor +``` + +### What it does + +1. Checks whether `/data/nfs/k3svolumes` is mounted (`mountpoint`). +2. Checks whether the mount is responsive (`timeout 2s stat`). +3. If either check fails, attempts: `mount -o remount -f`, then `umount -f` + `mount`. +4. On successful repair, force-deletes pods on this node that are stuck in + Unknown / Pending / ContainerCreating so the kubelet can reschedule them. + +Uses a lock file (`/var/run/nfs-mount-check.lock`) to prevent overlapping runs +since the timer fires faster than the script's worst-case runtime. + +### Timer configuration + +| Parameter | Value | Reason | +|-----------|-------|--------| +| `OnBootSec` | 30s | Let network and NFS client start before first check | +| `OnUnitActiveSec` | 10s | Check interval (was 1 min via cron; now tighter) | +| `AccuracySec` | 1s | Prevent systemd batching from delaying the 10 s interval | + +### Status and logs + +```sh +systemctl status nfs-mount-monitor.timer +journalctl -u nfs-mount-monitor -f +``` + ## AWS S3 Glacier Deep Archive Backups Encrypted incremental ZFS snapshots from `zdata` pool backed up daily to **AWS S3 Glacier Deep Archive** via cron. Scripts adapted from FreeBSD Home NAS setup. Also performs periodic zpool scrubbing. |
