conf - Configuration files for the automation of my personal infrastructure (servers, laptops, workstations, phones)!

Age	Commit message (Collapse)	Author
7 hours	goprecords: bump image to 0.5.2HEAD master	Paul Buetow

19 hours	git-server: keep ArgoCD app source on codeberg (revert internal repo change)	Paul Buetow

20 hours	git-server: switch ArgoCD app to pull from internal git-server	Paul Buetow

20 hours	git-server: rename cgit ingress from cgit.f3s.buetow.org to c-git.f3s.buetow.org	Paul Buetow

8 days	f3s: add robust USB key mounting	Paul Buetow

13 days	example-apache-volume-claim: add NFS sentinel initContainer check	Paul Buetow
	Adds the standard nfs-check initContainer to verify the sentinel file exists before the main Apache container starts. Prevents silent fall-back to local XFS when NFS is unmounted on the node.
13 days	Add NFS sentinel initContainers to wallabag chart for 86	Paul Buetow

13 days	Add NFS sentinel initContainers to xplayer chart for 76	Paul Buetow

13 days	Add NFS sentinel initContainers to filebrowser chart for r5	Paul Buetow

13 days	Add NFS sentinel initContainer to goprecords chart for t5	Paul Buetow

13 days	Replace old wait-for-nfs with standard sentinel initContainer in ↵	Paul Buetow
	immich-postgres for u5
13 days	Add NFS sentinel initContainers to audiobookshelf chart for q5	Paul Buetow

13 days	Add NFS sentinel initContainer to opodsync chart for 06	Paul Buetow

13 days	Add NFS sentinel initContainer to navidrome chart for z5	Paul Buetow

13 days	Add NFS sentinel initContainer to miniflux-postgres chart for y5	Paul Buetow

13 days	Add NFS sentinel initContainer to kobo-sync-server chart for x5	Paul Buetow

13 days	Add NFS sentinel initContainers to jellyfin chart for v5	Paul Buetow

13 days	Add NFS sentinel initContainer to pkgrepo chart for 16	Paul Buetow

13 days	Add NFS sentinel initContainer to registry chart for 46	Paul Buetow

13 days	Add NFS sentinel initContainers to syncthing chart for 56	Paul Buetow

13 days	Add NFS sentinel initContainers to radicale chart for 36	Paul Buetow

13 days	Add NFS sentinel check to anki chart for o5	Paul Buetow

13 days	Add keybr NFS sentinel initContainer for w5	Paul Buetow

13 days	Add apache NFS sentinel initContainer for p5	Paul Buetow

13 days	Add player NFS sentinel initContainers for task 26	Paul Buetow

13 days	Add NFS sentinel initContainer for git-server s5	Paul Buetow

13 days	Document NFS sentinel initContainer pattern for n5	Paul Buetow

2026-05-24	immich: update to v2.7.5	Paul Buetow

2026-05-16	f3s/monitoring: disable grafana, loki, tempo; reduce alloy to no-op	Paul Buetow
	Grafana's SQLite-on-NFS persistence is unreliable across restarts (the new pod can't reacquire a clean exclusive lock after any NFS bounce), and with Loki + Tempo also gone there's nothing left for it to visualize. Keeping Prometheus alone for metrics + alerting. Changes: - prometheus.yaml: add grafana.enabled=false in the kube-prometheus-stack values so the subchart no longer renders the grafana deployment/pvc. - loki.yaml, tempo.yaml, grafana-ingress.yaml: renamed to .disabled (same pattern as commit 03a18c6) so 'kubectl apply -f argocd-apps/' stops re-creating them; the cluster Applications were also deleted, which cascade-removes the helm resources via the resources-finalizer. - alloy.yaml: drop the loki.write and otelcol.* blocks (no destinations to ship to). DaemonSet stays deployed with a minimal 'logging' block so the chart can be re-enabled by restoring the blocks here. Prometheus TSDB was also wiped (corrupted zero-byte WAL segments from the same NFS blip that took grafana down) — done separately, not part of this commit.
2026-05-16	Update player image tags	Paul Buetow

2026-05-16	Update player image tags	Paul Buetow

2026-05-16	Give player deployments longer startup window	Paul Buetow

2026-05-16	Deploy xplayer and update player image	Paul Buetow

2026-05-16	f3s/prometheus: add trivy unresolved-alerts report generator	Paul Buetow
	Adds gen-trivy-unresolved-alerts.py which queries Prometheus (/api/v1/rules + /api/v1/alerts) via kubectl exec and produces TRIVY-UNRESOLVED-ALERTS.md. The generated *-ALERTS.md snapshots are gitignored — they're regenerable point-in-time inventories.
2026-05-16	f3s/registry: add coredns-custom ConfigMap for in-cluster registry DNS	Paul Buetow
	Trivy scan jobs do their own DNS lookups for image names and need registry.lan.buetow.org to resolve from inside the cluster. Adds a coredns-custom server block pointing the hostname at r0's WireGuard IP (which matches the k3s registries.yaml mirror target).
2026-05-16	nfs-mount-monitor: switch to soft NFS mount + handle stale lockfile	Paul Buetow
	A hard NFS mount that fails enters uninterruptible kernel sleep (D-state) which SIGKILL cannot wake, so the recovery script hangs forever and the lockfile stays — silently disabling all subsequent health checks. Switch the remount to explicit soft,timeo=50,retrans=3 so the kernel gives up after ~15s, and detect/remove lockfiles older than 90s left behind by a SIGKILL'd predecessor.
2026-05-16	f3s: set strategy Recreate on single-replica stateful deployments	Paul Buetow
	Prevents NFS-lock races during rolling updates. The hostPath PVs point at an NFS-shared directory mounted on every r-node, so RWO is not actually enforced across nodes — under the default RollingUpdate strategy the new pod can start on a different node and grab the same data dir while the old pod still holds file locks, producing errors like postgres' "could not write to file postmaster.pid: Unknown error 512". Applied to: immich-postgres, audiobookshelf, anki-sync-server, registry, pkgrepo, player, wallabag, miniflux-postgres, opodsync, radicale, kobo-sync-server, keybr, filebrowser, git-server, goprecords, jellyfin. (syncthing and navidrome already had it.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15	f3s: disable trivy-operator and tracing-demo (rename to .disabled)	Paul Buetow
	Both apps were causing high CPU pressure on r0 after a cold-start (Trivy respawning vulnerability scans, multiple replicas competing for image pulls). Disabled by renaming the ArgoCD Application manifests to .disabled so 'kubectl apply -f argocd-apps/' no longer picks them up, and the Applications themselves were deleted from the cluster (with prune=true the helm-managed resources were removed). Amp-Thread-ID: https://ampcode.com/threads/T-019e2be9-50a8-7089-b628-b6d844602c13 Co-authored-by: Amp <amp@ampcode.com>
2026-05-13	f3s/beets-art: schedule at noon Europe/Sofia instead of 03:30 UTC	Paul Buetow
	Use the CronJob timeZone field (GA since k8s 1.27, supported by the k3s 1.32 cluster) so cron interprets the schedule in local time directly, avoiding manual UTC conversion. Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea Co-authored-by: Amp <amp@ampcode.com>
2026-05-13	f3s/beets-art: fix fetchart silently disabled by string-not-list config	Paul Buetow
	beets 2.x parses `sources: coverart itunes amazon albumart` as a single key whose value is "*", rejects it with UnknownPairError, and the entire fetchart plugin fails to load. Net effect: every job ran "successfully" but fetched zero cover art (verified: 0/195 albums had artpath set; all existing cover.jpg files predated the deployment). Convert sources and cover_names to proper YAML lists so the plugin loads. Confirmed network egress is fine (CAA + iTunes return HTTP 200). Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea Co-authored-by: Amp <amp@ampcode.com>
2026-05-13	f3s/beets-art: pipe yes into embedart to bypass interactive prompt	Paul Buetow
	beet embedart (no -f) hard-codes a "Modify artwork for N albums (Y/n)?" confirmation with no flag to suppress it. The CronJob has no stdin, so the command exits with "stdin stream ended while input required" and embedart never runs. Pipe `yes` into the command; safety still relies on embedart.ifempty:no and embedart.compare_threshold:50 from config. Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea Co-authored-by: Amp <amp@ampcode.com>
2026-05-13	f3s/beets-art: fix BEETSDIR being read-only (state.pickle write failure)	Paul Buetow
	The ConfigMap mount at /etc/beets is kernel-enforced read-only, so beets could not write its incremental import state file (state.pickle), which broke incremental: yes — every nightly run would re-walk the entire library. Fix: point BEETSDIR at the writable state PVC (/state) and pass -c /etc/beets/config.yaml on every beet invocation so the ConfigMap is still the single source of truth for config. Also fix the Justfile run-now recipe to use a bash shebang so $() works. Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea Co-authored-by: Amp <amp@ampcode.com>
2026-05-13	f3s/beets-art: nightly k3s CronJob to fetch+embed cover art for Navidrome	Paul Buetow
	Adds a beets-based CronJob that runs every night on r1 (where the Navidrome music PVC lives), fetching external cover.jpg into each album folder and embedding art into audio files. Idempotent on re-runs: - import.incremental skips already-known album folders - fetchart skips albums that already have cover art - embedart with ifempty:no + compare_threshold:50 only fills missing embeds and refuses risky overwrites Navidrome picks new art up via its existing 1h scan; no Navidrome change required. Reuses navidrome-music-pvc directly (RWO is fine because both pods pin to r1 via nodeSelector). State (library.db, logs) lives on a small local-path PVC, regenerable by deleting the PVC. Files: f3s/beets-art/helm-chart/{Chart.yaml,README.md,templates/*.yaml} f3s/beets-art/Justfile (status, logs, run-now, suspend, resume, shell) f3s/argocd-apps/services/beets-art.yaml Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea Co-authored-by: Amp <amp@ampcode.com>
2026-05-10	navidrome: pin image to 0.61.2 and set imagePullPolicy: IfNotPresent	Paul Buetow
	Pinning to a specific version avoids silent breaking upgrades and makes kubectl rollout undo meaningful. IfNotPresent skips unnecessary re-pulls on pod restarts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	navidrome: move /data PVC from NFS to local-path on r1	Paul Buetow
	SQLite over NFS causes two problems: file-lock races on rolling restarts (fixed with Recreate strategy but underlying fragility remains), and 19s image-cache init at startup due to stunnel TLS latency on every cache read. Replace navidrome-data-pv/pvc (static hostPath over NFS at /data/nfs/k3svolumes/navidrome/data) with a dynamic local-path PVC provisioned on r1 (/var/lib/rancher/k3s/storage). Pin the deployment to r1 via nodeSelector so the local PV is always reachable. Existing DB and cache migrated: navidrome.db (23 MB), image/background/ plugin caches (~118 MB) copied via a migration pod before first start. Result: startupTime=41ms (was ~20s), Image cache init=29ms (was ~19s). Music PVC stays on NFS (200 GB library, unchanged). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-monitor: make textfile .prom world-readable for node_exporter	Paul Buetow
	node_exporter runs as uid 65534 (nobody); mktemp creates files with mode 600 (root-only). Add chmod 644 before the atomic mv so the node_exporter process can read nfs_mount_monitor.prom on its scrape. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-monitor: fix node_exporter textfile_collector Helm chart key	Paul Buetow
	Use extraHostVolumeMounts (prometheus-node-exporter sub-chart key for host path mounts) instead of extraVolumes/extraVolumeMounts, which are for general volumes. This correctly wires /var/lib/node_exporter/ textfile_collector into the container so the textfile arg takes effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-monitor: add Prometheus alerts for NFS auto-repair failures	Paul Buetow
	- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on every run (via write_textfile_metric helper, called from write_fail_count and directly on healthy runs); atomic tmp+mv write prevents partial reads - Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes - prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path /var/lib/node_exporter/textfile_collector into container - persistence-values.yaml: sync node_exporter textfile_collector config - nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts: NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning) NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical) wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-mount-monitor: escalate to reboot after N consecutive fix_mount failures	Paul Buetow
	Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count. Increment on every fix_mount failure; reset to 0 on any successful repair or when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s) consecutive failures the node is cordoned via kubectl and rebooted with 'systemctl reboot' so the cluster stops routing pods to a silently broken node. NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed as EnvironmentFile in the .service unit) without touching the script. Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the internal Rex loader path, not the Rexfile itself; use realpath($::rexfile) instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-mount-monitor: strengthen fix_mount recovery sequence	Paul Buetow
	Add lazy umount fallback, D-state process killer, stunnel restart, and 60-second hard deadline to prevent fix_mount from looping forever when processes are stuck in D state on a stale NFSv4-over-stunnel mount. Recovery sequence is now: 1. mount -o remount -f (cheap, no disruption) 2. kill_pinning_processes (SIGKILL D-state procs with nfs_ wchan) 3. umount -f (force unmount) 4. umount -l (lazy detach VFS node if -f failed) 5. systemctl restart stunnel + 2s sleep (refresh TLS transport) 6. mount (fresh mount) The 60s deadline uses bash $SECONDS so fix_mount can never outlast its own 10-second timer interval by an unbounded amount. Deployed to all three r-nodes (r0/r1/r2) via rex nfs_mount_monitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>