| Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
|
|
Adds the standard nfs-check initContainer to verify the sentinel file
exists before the main Apache container starts. Prevents silent fall-back
to local XFS when NFS is unmounted on the node.
|
|
|
|
|
|
|
|
|
|
immich-postgres for u5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Grafana's SQLite-on-NFS persistence is unreliable across restarts (the
new pod can't reacquire a clean exclusive lock after any NFS bounce),
and with Loki + Tempo also gone there's nothing left for it to
visualize. Keeping Prometheus alone for metrics + alerting.
Changes:
- prometheus.yaml: add grafana.enabled=false in the kube-prometheus-stack
values so the subchart no longer renders the grafana deployment/pvc.
- loki.yaml, tempo.yaml, grafana-ingress.yaml: renamed to .disabled
(same pattern as commit 03a18c6) so 'kubectl apply -f argocd-apps/'
stops re-creating them; the cluster Applications were also deleted,
which cascade-removes the helm resources via the resources-finalizer.
- alloy.yaml: drop the loki.write and otelcol.* blocks (no destinations
to ship to). DaemonSet stays deployed with a minimal 'logging' block
so the chart can be re-enabled by restoring the blocks here.
Prometheus TSDB was also wiped (corrupted zero-byte WAL segments from
the same NFS blip that took grafana down) — done separately, not part
of this commit.
|
|
|
|
|
|
|
|
|
|
Adds gen-trivy-unresolved-alerts.py which queries Prometheus
(/api/v1/rules + /api/v1/alerts) via kubectl exec and produces
TRIVY-UNRESOLVED-ALERTS.md. The generated *-ALERTS.md snapshots are
gitignored — they're regenerable point-in-time inventories.
|
|
Trivy scan jobs do their own DNS lookups for image names and need
registry.lan.buetow.org to resolve from inside the cluster. Adds a
coredns-custom server block pointing the hostname at r0's WireGuard IP
(which matches the k3s registries.yaml mirror target).
|
|
A hard NFS mount that fails enters uninterruptible kernel sleep (D-state)
which SIGKILL cannot wake, so the recovery script hangs forever and the
lockfile stays — silently disabling all subsequent health checks. Switch
the remount to explicit soft,timeo=50,retrans=3 so the kernel gives up
after ~15s, and detect/remove lockfiles older than 90s left behind by a
SIGKILL'd predecessor.
|
|
Prevents NFS-lock races during rolling updates. The hostPath PVs point at
an NFS-shared directory mounted on every r-node, so RWO is not actually
enforced across nodes — under the default RollingUpdate strategy the new
pod can start on a different node and grab the same data dir while the
old pod still holds file locks, producing errors like postgres'
"could not write to file postmaster.pid: Unknown error 512".
Applied to: immich-postgres, audiobookshelf, anki-sync-server, registry,
pkgrepo, player, wallabag, miniflux-postgres, opodsync, radicale,
kobo-sync-server, keybr, filebrowser, git-server, goprecords, jellyfin.
(syncthing and navidrome already had it.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
Both apps were causing high CPU pressure on r0 after a cold-start (Trivy
respawning vulnerability scans, multiple replicas competing for image
pulls). Disabled by renaming the ArgoCD Application manifests to
.disabled so 'kubectl apply -f argocd-apps/' no longer picks them up,
and the Applications themselves were deleted from the cluster (with
prune=true the helm-managed resources were removed).
Amp-Thread-ID: https://ampcode.com/threads/T-019e2be9-50a8-7089-b628-b6d844602c13
Co-authored-by: Amp <amp@ampcode.com>
|
|
Use the CronJob timeZone field (GA since k8s 1.27, supported by the k3s
1.32 cluster) so cron interprets the schedule in local time directly,
avoiding manual UTC conversion.
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
beets 2.x parses `sources: coverart itunes amazon albumart` as a single
key whose value is "*", rejects it with UnknownPairError, and the entire
fetchart plugin fails to load. Net effect: every job ran "successfully"
but fetched zero cover art (verified: 0/195 albums had artpath set; all
existing cover.jpg files predated the deployment).
Convert sources and cover_names to proper YAML lists so the plugin
loads. Confirmed network egress is fine (CAA + iTunes return HTTP 200).
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
beet embedart (no -f) hard-codes a "Modify artwork for N albums (Y/n)?"
confirmation with no flag to suppress it. The CronJob has no stdin, so
the command exits with "stdin stream ended while input required" and
embedart never runs. Pipe `yes` into the command; safety still relies
on embedart.ifempty:no and embedart.compare_threshold:50 from config.
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
The ConfigMap mount at /etc/beets is kernel-enforced read-only, so beets
could not write its incremental import state file (state.pickle), which
broke incremental: yes — every nightly run would re-walk the entire
library.
Fix: point BEETSDIR at the writable state PVC (/state) and pass
-c /etc/beets/config.yaml on every beet invocation so the ConfigMap is
still the single source of truth for config.
Also fix the Justfile run-now recipe to use a bash shebang so $() works.
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
Adds a beets-based CronJob that runs every night on r1 (where the
Navidrome music PVC lives), fetching external cover.jpg into each album
folder and embedding art into audio files. Idempotent on re-runs:
- import.incremental skips already-known album folders
- fetchart skips albums that already have cover art
- embedart with ifempty:no + compare_threshold:50 only fills missing
embeds and refuses risky overwrites
Navidrome picks new art up via its existing 1h scan; no Navidrome change
required. Reuses navidrome-music-pvc directly (RWO is fine because both
pods pin to r1 via nodeSelector). State (library.db, logs) lives on a
small local-path PVC, regenerable by deleting the PVC.
Files: f3s/beets-art/helm-chart/{Chart.yaml,README.md,templates/*.yaml}
f3s/beets-art/Justfile (status, logs, run-now, suspend, resume, shell)
f3s/argocd-apps/services/beets-art.yaml
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
Pinning to a specific version avoids silent breaking upgrades and makes
kubectl rollout undo meaningful. IfNotPresent skips unnecessary re-pulls
on pod restarts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
SQLite over NFS causes two problems: file-lock races on rolling
restarts (fixed with Recreate strategy but underlying fragility
remains), and 19s image-cache init at startup due to stunnel TLS
latency on every cache read.
Replace navidrome-data-pv/pvc (static hostPath over NFS at
/data/nfs/k3svolumes/navidrome/data) with a dynamic local-path PVC
provisioned on r1 (/var/lib/rancher/k3s/storage). Pin the deployment
to r1 via nodeSelector so the local PV is always reachable.
Existing DB and cache migrated: navidrome.db (23 MB), image/background/
plugin caches (~118 MB) copied via a migration pod before first start.
Result: startupTime=41ms (was ~20s), Image cache init=29ms (was ~19s).
Music PVC stays on NFS (200 GB library, unchanged).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
node_exporter runs as uid 65534 (nobody); mktemp creates files with
mode 600 (root-only). Add chmod 644 before the atomic mv so the
node_exporter process can read nfs_mount_monitor.prom on its scrape.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Use extraHostVolumeMounts (prometheus-node-exporter sub-chart key for
host path mounts) instead of extraVolumes/extraVolumeMounts, which are
for general volumes. This correctly wires /var/lib/node_exporter/
textfile_collector into the container so the textfile arg takes effect.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge
to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on
every run (via write_textfile_metric helper, called from write_fail_count
and directly on healthy runs); atomic tmp+mv write prevents partial reads
- Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes
- prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter
DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path
/var/lib/node_exporter/textfile_collector into container
- persistence-values.yaml: sync node_exporter textfile_collector config
- nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts:
NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning)
NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical)
wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval
Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count.
Increment on every fix_mount failure; reset to 0 on any successful repair or
when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s)
consecutive failures the node is cordoned via kubectl and rebooted with
'systemctl reboot' so the cluster stops routing pods to a silently broken node.
NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed
as EnvironmentFile in the .service unit) without touching the script.
Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the
internal Rex loader path, not the Rexfile itself; use realpath($::rexfile)
instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Add lazy umount fallback, D-state process killer, stunnel restart, and
60-second hard deadline to prevent fix_mount from looping forever when
processes are stuck in D state on a stale NFSv4-over-stunnel mount.
Recovery sequence is now:
1. mount -o remount -f (cheap, no disruption)
2. kill_pinning_processes (SIGKILL D-state procs with nfs_ wchan)
3. umount -f (force unmount)
4. umount -l (lazy detach VFS node if -f failed)
5. systemctl restart stunnel + 2s sleep (refresh TLS transport)
6. mount (fresh mount)
The 60s deadline uses bash $SECONDS so fix_mount can never outlast its
own 10-second timer interval by an unbounded amount. Deployed to all
three r-nodes (r0/r1/r2) via rex nfs_mount_monitor.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|