diff options
| author | Paul Buetow <paul@buetow.org> | 2026-05-10 11:15:43 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2026-05-10 11:15:43 +0300 |
| commit | 1ab9a49490110d5e26eac2a68dfc1994f4966f73 (patch) | |
| tree | 0a21d1f9ada5fecd22155f9382578b290faeda26 | |
| parent | 360d06ceb88d0965aeb7dc9831f48725ad6ebc89 (diff) | |
f3s/storage: update NFS fstab to hard,timeo=600,retrans=3
Switch r0/r1/r2 NFS client mount options from soft,timeo=10,retrans=2,intr
to hard,timeo=600,retrans=3. Removes the soft mount that was causing EIO on
transient stunnel/CARP hiccups; hard mounts retry indefinitely, which is
safer for DB-like workloads (immich, miniflux, audiobookshelf). The intr
option was a no-op since kernel 2.6.25 and is dropped. Navidrome's /data
is now on local-path so it is unaffected by this change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| -rw-r--r-- | prompts/skills/f3s/references/storage.md | 80 |
1 files changed, 73 insertions, 7 deletions
diff --git a/prompts/skills/f3s/references/storage.md b/prompts/skills/f3s/references/storage.md index 958a91d..2c9cca6 100644 --- a/prompts/skills/f3s/references/storage.md +++ b/prompts/skills/f3s/references/storage.md @@ -549,7 +549,7 @@ mount -t nfs4 -o port=2323 127.0.0.1:/k3svolumes /data/nfs/k3svolumes `/etc/fstab`: ``` -127.0.0.1:/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,_netdev,soft,timeo=10,retrans=2,intr 0 0 +127.0.0.1:/k3svolumes /data/nfs/k3svolumes nfs4 port=2323,_netdev,hard,timeo=600,retrans=3 0 0 ``` NFS path structure on k3s nodes: `/data/nfs/k3svolumes/<app>/` @@ -568,6 +568,8 @@ NFS path structure on k3s nodes: `/data/nfs/k3svolumes/<app>/` After NFS is restored on the server side, the `nfs-mount-monitor` systemd timer on each r-node will auto-remount within ~10 seconds and force-delete stuck pods. If immediate recovery is needed: `mount /data/nfs/k3svolumes` on each r-node, then delete the stuck pods manually. +**Note:** The monitor catches three failure modes: missing mountpoint, stat hang (reads unresponsive), and **silent write hang** (reads OK but writes block — the hardest case, e.g. stunnel-wrapped NFSv4 after a CARP failover). Watch the consecutive-failure counter via Prometheus (`nfs_mount_monitor_consecutive_failures`) — warning fires at ≥3, critical at ≥5. At 5 consecutive failures the node cordons itself and reboots. + ### Checklist for NFS outage on CARP MASTER (f0 or f1) ```sh @@ -612,11 +614,39 @@ rex -f f3s/r-nodes/Rexfile nfs_mount_monitor ### What it does -1. Checks whether `/data/nfs/k3svolumes` is mounted (`mountpoint`). -2. Checks whether the mount is responsive (`timeout 2s stat`). -3. If either check fails, attempts: `mount -o remount -f`, then `umount -f` + `mount`. -4. On successful repair, force-deletes pods on this node that are stuck in - Unknown / Pending / ContainerCreating so the kubelet can reschedule them. +Three probes run in sequence on every 10-second tick: + +1. **mountpoint probe** — detects completely missing mounts. +2. **stat probe** (`timeout 2s stat`) — detects read hangs / stale cache misses. +3. **write probe** (`timeout 5s sh -c "echo $$ > .healthcheck.<host> && rm -f ..."`) — + detects the "reads OK, writes hang" failure mode. Stunnel-wrapped NFSv4 can enter + a state where `stat` returns from cache but all writes block indefinitely; only this + probe catches it. + +If any probe fails, `fix_mount` runs: + +1. `mount -o remount -f` (cheapest, no disruption if mount is merely stale) +2. Kill D-state processes pinning the mount (`kill_pinning_processes` — SIGKILLs + processes whose `wchan` starts with `nfs_` and whose cwd/fds point into the mountpoint) +3. `umount -f` (force unmount) +4. `umount -l` (lazy detach VFS node if `-f` failed) +5. `systemctl restart stunnel` + 2s sleep (refresh the TLS transport) +6. `mount` (fresh mount via stunnel) + +A hard **60-second deadline** prevents `fix_mount` from outlasting its own timer interval. + +On successful repair, force-deletes pods on this node stuck in +Unknown / Pending / ContainerCreating so the kubelet can reschedule them. + +**Consecutive-failure escalation**: each `fix_mount` failure increments a counter +persisted to `/var/lib/nfs-mount-monitor/fail-count`. At `NFS_FAIL_THRESHOLD=5` +consecutive failures (~50 s), the node cordons itself (`kubectl cordon`) and issues +`systemctl reboot`. + +The counter is also exported to `/var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom` +so Prometheus can alert on `nfs_mount_monitor_consecutive_failures` without parsing +journal logs (warning ≥3, critical ≥5 — see +`f3s/prometheus/manifests/nfs-mount-monitor-alerts.yaml`). Uses a lock file (`/var/run/nfs-mount-check.lock`) to prevent overlapping runs since the timer fires faster than the script's worst-case runtime. @@ -626,7 +656,7 @@ since the timer fires faster than the script's worst-case runtime. | Parameter | Value | Reason | |-----------|-------|--------| | `OnBootSec` | 30s | Let network and NFS client start before first check | -| `OnUnitActiveSec` | 10s | Check interval (was 1 min via cron; now tighter) | +| `OnUnitActiveSec` | 10s | Check interval; each run is bounded by a 60-second deadline | | `AccuracySec` | 1s | Prevent systemd batching from delaying the 10 s interval | ### Status and logs @@ -640,6 +670,41 @@ journalctl -u nfs-mount-monitor -f Encrypted incremental ZFS snapshots from `zdata` pool backed up daily to **AWS S3 Glacier Deep Archive** via cron. Scripts adapted from FreeBSD Home NAS setup. Also performs periodic zpool scrubbing. +## Local-Path Storage for SQLite Workloads + +Some k3s workloads use `local-path` (k3s default storageClass) instead of NFS for +their data volumes. This is appropriate when: + +- The application uses SQLite: NFS file-lock semantics cause `fcntl()` races on + pod restarts, and `Recreate` strategy only reduces (not eliminates) the risk. +- Cache-heavy workloads: NFS over stunnel adds TLS round-trip latency to every + cache read. Navidrome's image/background cache init took ~19s over NFS; it + takes ~25ms from local disk. + +**Trade-off**: a local-path PV lives on one specific node. If that node is down, +the pod reschedules elsewhere but finds no data volume — it starts with an empty DB, +losing play history, scrobble queue, etc. For a home server this is acceptable. +The deployment must pin the pod to the same node via `nodeSelector` so the local +PV is always reachable. + +### Workloads using local-path + +| App | Node | Path on node | +|-----|------|--------------| +| navidrome `/data` (DB + cache) | r1 | `/var/lib/rancher/k3s/storage/pvc-*_services_navidrome-data-pvc` | + +### Migrating NFS hostPath → local-path + +1. Disable ArgoCD auto-sync: `kubectl patch application <app> -n cicd --type=json -p='[{"op":"replace","path":"/spec/syncPolicy","value":{}}]'` +2. Scale deployment to 0: `kubectl scale deployment <app> -n services --replicas=0` +3. Delete old PVC and static PV. +4. Create new PVC with `storageClassName: local-path`. +5. Create a migration pod pinned to the target node that mounts both the NFS hostPath + (source) and the new PVC (target); copy data with `cp -av /src/. /dst/`. +6. Delete migration pod, apply updated deployment (with `nodeSelector`), scale back up. +7. Re-enable ArgoCD auto-sync and push manifests to git; push to in-cluster git-server + (`git push r0 master`) so ArgoCD picks up the new storageClass spec. + ## Storage Summary | Layer | Technology | Role | @@ -649,5 +714,6 @@ Encrypted incremental ZFS snapshots from `zdata` pool backed up daily to **AWS S | Replication | `zrepl` | Continuous ZFS replication f0→f1 (1min NFS, 10min VM) | | HA | CARP VIP 192.168.1.138 | Automatic failover for NFS/stunnel | | Network | NFS over stunnel | Encrypted shared storage, mutual TLS auth | +| Local-path | k3s local-path provisioner | Node-local storage for SQLite/cache workloads | | LAN access | FreeBSD relayd on CARP VIP | TCP forwarding to k3s :80/:443 | | Backup | S3 Glacier Deep Archive | Off-site encrypted backup | |
