summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-05-10 10:25:28 +0300
committerPaul Buetow <paul@buetow.org>2026-05-10 10:25:28 +0300
commit360d06ceb88d0965aeb7dc9831f48725ad6ebc89 (patch)
tree7d939044e4d3a1fb597b161869af630786536443
parent7940bfbd7456e0c65f75b5d80bf88bd05a2484dd (diff)
f3s/storage: document NFS auto-repair subsystem and fix stale cron reference
Update 'Pods stuck' troubleshooting entry: the script is now driven by a systemd timer (10 s) rather than a cron job (1 min). Add a new 'NFS Auto-Repair: nfs-mount-monitor' section documenting the repo layout (f3s/r-nodes/nfs-mount-monitor/), the Rex deploy command, what the script does, and timer configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
-rw-r--r--prompts/skills/f3s/references/storage.md49
1 files changed, 48 insertions, 1 deletions
diff --git a/prompts/skills/f3s/references/storage.md b/prompts/skills/f3s/references/storage.md
index ea77260..958a91d 100644
--- a/prompts/skills/f3s/references/storage.md
+++ b/prompts/skills/f3s/references/storage.md
@@ -566,7 +566,7 @@ NFS path structure on k3s nodes: `/data/nfs/k3svolumes/<app>/`
### Pods stuck in ContainerCreating/Unknown after NFS recovery
-After NFS is restored on the server side, the r-nodes' cron job (`check-nfs-mount.sh`) will auto-remount within 1 minute and force-delete stuck pods. If immediate recovery is needed: `mount /data/nfs/k3svolumes` on each r-node, then delete the stuck pods manually.
+After NFS is restored on the server side, the `nfs-mount-monitor` systemd timer on each r-node will auto-remount within ~10 seconds and force-delete stuck pods. If immediate recovery is needed: `mount /data/nfs/k3svolumes` on each r-node, then delete the stuck pods manually.
### Checklist for NFS outage on CARP MASTER (f0 or f1)
@@ -589,6 +589,53 @@ doas service nfsd restart
doas service stunnel start
```
+## NFS Auto-Repair: nfs-mount-monitor
+
+A systemd timer+service pair on r0/r1/r2 checks the NFS mount every 10 seconds and automatically repairs it if stale or missing.
+
+### Repo location
+
+```
+f3s/r-nodes/nfs-mount-monitor/
+ check-nfs-mount.sh # repair script → /usr/local/bin/
+ nfs-mount-monitor.service # one-shot service → /etc/systemd/system/
+ nfs-mount-monitor.timer # 10-second timer → /etc/systemd/system/
+f3s/r-nodes/Rexfile # Rex deploy task: nfs_mount_monitor
+```
+
+### Deploy
+
+```sh
+# From repo root — pushes to all three r-nodes and reloads systemd if anything changed
+rex -f f3s/r-nodes/Rexfile nfs_mount_monitor
+```
+
+### What it does
+
+1. Checks whether `/data/nfs/k3svolumes` is mounted (`mountpoint`).
+2. Checks whether the mount is responsive (`timeout 2s stat`).
+3. If either check fails, attempts: `mount -o remount -f`, then `umount -f` + `mount`.
+4. On successful repair, force-deletes pods on this node that are stuck in
+ Unknown / Pending / ContainerCreating so the kubelet can reschedule them.
+
+Uses a lock file (`/var/run/nfs-mount-check.lock`) to prevent overlapping runs
+since the timer fires faster than the script's worst-case runtime.
+
+### Timer configuration
+
+| Parameter | Value | Reason |
+|-----------|-------|--------|
+| `OnBootSec` | 30s | Let network and NFS client start before first check |
+| `OnUnitActiveSec` | 10s | Check interval (was 1 min via cron; now tighter) |
+| `AccuracySec` | 1s | Prevent systemd batching from delaying the 10 s interval |
+
+### Status and logs
+
+```sh
+systemctl status nfs-mount-monitor.timer
+journalctl -u nfs-mount-monitor -f
+```
+
## AWS S3 Glacier Deep Archive Backups
Encrypted incremental ZFS snapshots from `zdata` pool backed up daily to **AWS S3 Glacier Deep Archive** via cron. Scripts adapted from FreeBSD Home NAS setup. Also performs periodic zpool scrubbing.