conf - Configuration files for the automation of my personal infrastructure (servers, laptops, workstations, phones)!

Age	Commit message (Collapse)	Author
2026-05-10	nfs-monitor: fix node_exporter textfile_collector Helm chart key	Paul Buetow
	Use extraHostVolumeMounts (prometheus-node-exporter sub-chart key for host path mounts) instead of extraVolumes/extraVolumeMounts, which are for general volumes. This correctly wires /var/lib/node_exporter/ textfile_collector into the container so the textfile arg takes effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-monitor: add Prometheus alerts for NFS auto-repair failures	Paul Buetow
	- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on every run (via write_textfile_metric helper, called from write_fail_count and directly on healthy runs); atomic tmp+mv write prevents partial reads - Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes - prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path /var/lib/node_exporter/textfile_collector into container - persistence-values.yaml: sync node_exporter textfile_collector config - nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts: NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning) NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical) wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-mount-monitor: escalate to reboot after N consecutive fix_mount failures	Paul Buetow
	Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count. Increment on every fix_mount failure; reset to 0 on any successful repair or when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s) consecutive failures the node is cordoned via kubectl and rebooted with 'systemctl reboot' so the cluster stops routing pods to a silently broken node. NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed as EnvironmentFile in the .service unit) without touching the script. Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the internal Rex loader path, not the Rexfile itself; use realpath($::rexfile) instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-mount-monitor: strengthen fix_mount recovery sequence	Paul Buetow
	Add lazy umount fallback, D-state process killer, stunnel restart, and 60-second hard deadline to prevent fix_mount from looping forever when processes are stuck in D state on a stale NFSv4-over-stunnel mount. Recovery sequence is now: 1. mount -o remount -f (cheap, no disruption) 2. kill_pinning_processes (SIGKILL D-state procs with nfs_ wchan) 3. umount -f (force unmount) 4. umount -l (lazy detach VFS node if -f failed) 5. systemctl restart stunnel + 2s sleep (refresh TLS transport) 6. mount (fresh mount) The 60s deadline uses bash $SECONDS so fix_mount can never outlast its own 10-second timer interval by an unbounded amount. Deployed to all three r-nodes (r0/r1/r2) via rex nfs_mount_monitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	nfs-mount-monitor: add write-probe to detect 'reads OK, writes hang' state	Paul Buetow
	Stunnel-wrapped NFSv4 can enter a half-broken state where mountpoint(1) returns true and stat(1) completes from cache, but ALL writes hang indefinitely. This was observed on r2 on 2026-05-10 causing navidrome to be unschedulable. The existing two probes passed while writes were dead. Add a third probe (write-probe) after the stat probe: write the shell's PID to a per-host .healthcheck.<hostname> file and immediately remove it, wrapped in a 5-second timeout. The per-host filename prevents r0/r1/r2 from racing on the same file. 5s gives one full NFS retransmit window (timeo=10 deciseconds = 1s, retrans=2) plus margin without making the 10-second timer run too long. Deployed to r0/r1/r2 via rex nfs_mount_monitor; all three nodes confirmed running the new script (journalctl shows clean exits). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	f3s/r-nodes: track NFS auto-repair script and systemd units in conf repo	Paul Buetow
	Pull check-nfs-mount.sh, nfs-mount-monitor.service, and nfs-mount-monitor.timer from r0/r1/r2 (confirmed identical on all three nodes) into f3s/r-nodes/nfs-mount-monitor/. Add f3s/r-nodes/Rexfile with an idempotent nfs_mount_monitor task that pushes the files to all three r-nodes as root and reloads systemd when content changes. Wire the new Rexfile into the repo root Rexfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10	navidrome: use Recreate strategy	Paul Buetow
	SQLite on RWO PVC can only have one writer. RollingUpdate keeps the old pod alive until the new pod is Ready, but the new pod blocks indefinitely on SQLite open because the old pod still holds the db-shm/db-wal lock. Recreate kills the old pod first. Amp-Thread-ID: https://ampcode.com/threads/T-019e109f-fa43-7467-bb0b-1a4a2d3d0b9e Co-authored-by: Amp <amp@ampcode.com>
2026-05-10	navidrome: add startup/readiness/liveness probes	Paul Buetow
	Prevents Traefik from routing traffic to navidrome before it has finished its ~1m14s cold start (NFS cache warm-up + initial scan), which was causing 501s and 'context canceled' errors right after each cluster boot. Amp-Thread-ID: https://ampcode.com/threads/T-019e109f-fa43-7467-bb0b-1a4a2d3d0b9e Co-authored-by: Amp <amp@ampcode.com>
2026-05-09	add xplayer	Paul Buetow

2026-05-03	Update player image tag	Paul Buetow

2026-05-03	Add player f3s deployment	Paul Buetow

2026-05-03	add player.f3s.buetow.org	Paul Buetow

2026-04-17	frontends/relayd: route anki.f3s.buetow.org directly to NodePort 30800	Paul Buetow
	Bypass Traefik for anki-sync-server to fix HTTP 303 stream failures. The Anki client maps zstd response body read errors to SEE_OTHER (303). This was caused by Traefik's HTTP proxy layer interfering with the binary zstd-compressed response bodies. Route directly to the anki NodePort like Jellyfin's 30096, which avoids the double-proxy issue.
2026-04-17	f3s/anki-sync-server: add debug NodePort service on 30800	Paul Buetow
	Exposes anki-sync-server directly on all k3s nodes at NodePort 30800, bypassing Traefik. Used to isolate whether the HTTP 303 stream failures (Anki client maps zstd body read errors to SEE_OTHER) originate in the Traefik HTTP proxy layer or in the pod itself.
2026-04-16	goprecords: restore for:1h after alert test	Paul Buetow

2026-04-16	goprecords: temp set for:1m for alert test	Paul Buetow

2026-04-16	goprecords: add Prometheus scraping and stale-host alert rule	Paul Buetow
	- service.yaml: add 'metrics' port (8080) so kubernetes SD auto-discovers the /metrics endpoint alongside the existing http port (80) - prometheus/manifests/goprecords-alerts.yaml: GoprecordsHostNotReporting fires (warning) when a non-excluded host last reported >5 months ago Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16	goprecords: bump to 0.5.1	Paul Buetow
	Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16	goprecords: bump to 0.5.0	Paul Buetow
	Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16	frontends: switch goprecords upload to unified script with separate token file	Paul Buetow
	Deploy goprecords-upload-client.sh from goprecords/scripts/ instead of the inline-token template. Token is now stored in /etc/goprecords-upload.token (mode 600) and the script reads it at runtime. Old goprecords-upload.sh (token baked in, mode 500) is removed. daily.local entry updated to pass GOPRECORDS_HOST=<host> as environment variable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14	frontends: daily goprecords uptimed upload for fishfinger and blowfish.	Paul Buetow
	Add POSIX sh script template deployed to /usr/local/bin/goprecords-upload.sh, invoked from /etc/daily.local. Rex task goprecords_upload installs curl, renders per-host script from geheim secrets/etc/goprecords/<host>.token, and hooks commons. Document token layout and kubectl key creation in README. Made-with: Cursor
2026-04-14	goprecords: deploy image 0.4.1 (HTML without top nav links).	Paul Buetow
	Made-with: Cursor
2026-04-14	goprecords: deploy image 0.4.0 (foo.zone-style HTML and stats order).	Paul Buetow
	Made-with: Cursor
2026-04-14	goprecords: deploy image 0.3.1 (HTML dashboard at /).	Paul Buetow
	Bump Helm and docker-image tags for the new goprecords release. Made-with: Cursor
2026-04-14	Adjust goprecords runtime permissions for cluster storage.	Paul Buetow
	Run the goprecords pod as root and keep the hostPath PV type aligned with the existing immutable volume configuration so ArgoCD can sync cleanly while the service can create and open its auth database on the shared stats path. Made-with: Cursor
2026-04-14	Add goprecords service deployment for f3s.	Paul Buetow
	Introduce Docker build/push workflow, Helm manifests, and ArgoCD application wiring for goprecords so the cluster can deploy the new daemon API service from the private registry. Made-with: Cursor
2026-04-13	always Alwyas	Paul Buetow

2026-04-13	add goprecords.f3s.buetow.org	Paul Buetow

2026-04-11	pihole: docker-pi dnsmasq wildcard, README for pi2/pi3, ArgoCD parity	Paul Buetow
	Add dnsmasq.d wildcard for *.f3s.lan.buetow.org → 192.168.1.138 and example compose for Pis; refresh README (DNS on pi2/pi3, etc-dnsmasq.d). Align dormant ArgoCD Helm customDnsEntries with the same wildcard. Made-with: Cursor
2026-04-10	fix	Paul Buetow

2026-04-10	add ema	Paul Buetow

2026-04-10	acme.sh: skip standby certs for server FQDNs, restart relayd if dead	Paul Buetow
	- Skip standby.blowfish.buetow.org and standby.fishfinger.buetow.org (no DNS records, no httpd/acme-client.conf entries) - Use 'rcctl check && reload \|\| restart' for relayd so a dead relayd gets restarted instead of silently failing on reload Amp-Thread-ID: https://ampcode.com/threads/T-019d77bf-0537-74e1-a1a9-c1b47d2af392 Co-authored-by: Amp <amp@ampcode.com>
2026-04-10	snonux.foo: route to Pi backends at /snonux, redirect www	Paul Buetow
	- relayd: route www.snonux.foo to localhost for redirect, keep bare/standby on f3s_static_proxy - httpd: www.snonux.foo returns 302 redirect to snonux.foo - gogios: monitor pi0/pi1 via wg0.wan.buetow.org instead of lan.buetow.org - AGENTS.md: document Pi lighttpd Host-based virtual hosting pattern Amp-Thread-ID: https://ampcode.com/threads/T-019d7766-909d-741c-bcb9-1e1e931f1e1b Co-authored-by: Amp <amp@ampcode.com>
2026-04-08	Add offline-page fallback for f3s static relay	Paul Buetow

2026-04-08	Return HTTP errors for dead f3s static backends	Paul Buetow

2026-04-08	Route f3s.buetow.org to Pi static backends	Paul Buetow

2026-04-08	add pi0 and pi1	Paul Buetow

2026-04-08	Deactivate Apache ArgoCD application	Paul Buetow
	Amp-Thread-ID: https://ampcode.com/threads/T-019d6da8-3a08-7079-bb2a-eb072c0bf17f Co-authored-by: Amp <amp@ampcode.com>
2026-04-08	h0: document PI phase 3.2 role split	Paul Buetow

2026-04-08	g0: add PI Phase 3.1 verification notes	Paul Buetow

2026-04-08	f0: document Pi-hole phase 2.2 deployment	Paul Buetow

2026-04-08	d0: document PI phase 1.2 static content sync	Paul Buetow

2026-04-08	Document PI phase 2.1 Docker CE for e0	Paul Buetow

2026-04-08	c0: document pi0/pi1 lighttpd phase 1.1	Paul Buetow

2026-04-08	b0: document PI phase 0.2 hostname verification	Paul Buetow

2026-04-08	a0: record PI Phase 0.1 baseline	Paul Buetow

2026-04-08	feat(f3s): deploy Trivy Operator for image CVE scanning (task h)	Paul Buetow
	- ArgoCD app: aquasecurity/trivy-operator in monitoring with ServiceMonitor - PrometheusRule for Critical/High trivy_image_vulnerabilities alerts - Alertmanager route/receiver for component=trivy (UI; webhook TBD) Made-with: Cursor
2026-04-08	garage: bind S3 and admin endpoints on IPv4	Paul Buetow
	Ensure Garage listens on WireGuard IPv4 addresses so relay hosts can reach node S3/admin ports reliably. Made-with: Cursor
2026-04-08	f3s/prometheus: add Garage admin scrape targets (task f)	Paul Buetow
	Add job_name garage for 192.168.2.130-132:3903 with os=freebsd label. Mirror config in additional-scrape-configs-secret for kube apply/ArgoCD. Made-with: Cursor
2026-04-08	garage: Garage 2.2 TOML schema and deploy permissions	Paul Buetow
	Align etc/garage.f*.toml with garage-2.2.0 (metadata_dir, data_dir, rpc_secret, rpc_bind_addr, rpc_public_addr per host, s3_api/admin, replication_factor). Bind RPC on 0.0.0.0:3901 so IPv4 LAN peers can reach nodes on FreeBSD. Install config as root:garage 640 so the rc.d garage user can read garage.toml. Made-with: Cursor