| Age | Commit message (Collapse) | Author |
|
Grafana's SQLite-on-NFS persistence is unreliable across restarts (the
new pod can't reacquire a clean exclusive lock after any NFS bounce),
and with Loki + Tempo also gone there's nothing left for it to
visualize. Keeping Prometheus alone for metrics + alerting.
Changes:
- prometheus.yaml: add grafana.enabled=false in the kube-prometheus-stack
values so the subchart no longer renders the grafana deployment/pvc.
- loki.yaml, tempo.yaml, grafana-ingress.yaml: renamed to .disabled
(same pattern as commit 03a18c6) so 'kubectl apply -f argocd-apps/'
stops re-creating them; the cluster Applications were also deleted,
which cascade-removes the helm resources via the resources-finalizer.
- alloy.yaml: drop the loki.write and otelcol.* blocks (no destinations
to ship to). DaemonSet stays deployed with a minimal 'logging' block
so the chart can be re-enabled by restoring the blocks here.
Prometheus TSDB was also wiped (corrupted zero-byte WAL segments from
the same NFS blip that took grafana down) — done separately, not part
of this commit.
|
|
|
|
|
|
|
|
|
|
Adds gen-trivy-unresolved-alerts.py which queries Prometheus
(/api/v1/rules + /api/v1/alerts) via kubectl exec and produces
TRIVY-UNRESOLVED-ALERTS.md. The generated *-ALERTS.md snapshots are
gitignored — they're regenerable point-in-time inventories.
|
|
Trivy scan jobs do their own DNS lookups for image names and need
registry.lan.buetow.org to resolve from inside the cluster. Adds a
coredns-custom server block pointing the hostname at r0's WireGuard IP
(which matches the k3s registries.yaml mirror target).
|
|
Adds FreeBSD .tpl variants of the existing dserver templates and a
matching pkg-dtail-freebsd.sh packaging script, plus a pkg-dtail-rpm.sh
script and packages/files/dtail-rocky/ (systemd units, key-cache script,
dtail.json) for the Rocky Linux dtail build.
|
|
A hard NFS mount that fails enters uninterruptible kernel sleep (D-state)
which SIGKILL cannot wake, so the recovery script hangs forever and the
lockfile stays — silently disabling all subsequent health checks. Switch
the remount to explicit soft,timeo=50,retrans=3 so the kernel gives up
after ~15s, and detect/remove lockfiles older than 90s left behind by a
SIGKILL'd predecessor.
|
|
Prevents NFS-lock races during rolling updates. The hostPath PVs point at
an NFS-shared directory mounted on every r-node, so RWO is not actually
enforced across nodes — under the default RollingUpdate strategy the new
pod can start on a different node and grab the same data dir while the
old pod still holds file locks, producing errors like postgres'
"could not write to file postmaster.pid: Unknown error 512".
Applied to: immich-postgres, audiobookshelf, anki-sync-server, registry,
pkgrepo, player, wallabag, miniflux-postgres, opodsync, radicale,
kobo-sync-server, keybr, filebrowser, git-server, goprecords, jellyfin.
(syncthing and navidrome already had it.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
Both apps were causing high CPU pressure on r0 after a cold-start (Trivy
respawning vulnerability scans, multiple replicas competing for image
pulls). Disabled by renaming the ArgoCD Application manifests to
.disabled so 'kubectl apply -f argocd-apps/' no longer picks them up,
and the Applications themselves were deleted from the cluster (with
prune=true the helm-managed resources were removed).
Amp-Thread-ID: https://ampcode.com/threads/T-019e2be9-50a8-7089-b628-b6d844602c13
Co-authored-by: Amp <amp@ampcode.com>
|
|
Use the CronJob timeZone field (GA since k8s 1.27, supported by the k3s
1.32 cluster) so cron interprets the schedule in local time directly,
avoiding manual UTC conversion.
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
beets 2.x parses `sources: coverart itunes amazon albumart` as a single
key whose value is "*", rejects it with UnknownPairError, and the entire
fetchart plugin fails to load. Net effect: every job ran "successfully"
but fetched zero cover art (verified: 0/195 albums had artpath set; all
existing cover.jpg files predated the deployment).
Convert sources and cover_names to proper YAML lists so the plugin
loads. Confirmed network egress is fine (CAA + iTunes return HTTP 200).
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
beet embedart (no -f) hard-codes a "Modify artwork for N albums (Y/n)?"
confirmation with no flag to suppress it. The CronJob has no stdin, so
the command exits with "stdin stream ended while input required" and
embedart never runs. Pipe `yes` into the command; safety still relies
on embedart.ifempty:no and embedart.compare_threshold:50 from config.
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
The ConfigMap mount at /etc/beets is kernel-enforced read-only, so beets
could not write its incremental import state file (state.pickle), which
broke incremental: yes — every nightly run would re-walk the entire
library.
Fix: point BEETSDIR at the writable state PVC (/state) and pass
-c /etc/beets/config.yaml on every beet invocation so the ConfigMap is
still the single source of truth for config.
Also fix the Justfile run-now recipe to use a bash shebang so $() works.
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
Adds a beets-based CronJob that runs every night on r1 (where the
Navidrome music PVC lives), fetching external cover.jpg into each album
folder and embedding art into audio files. Idempotent on re-runs:
- import.incremental skips already-known album folders
- fetchart skips albums that already have cover art
- embedart with ifempty:no + compare_threshold:50 only fills missing
embeds and refuses risky overwrites
Navidrome picks new art up via its existing 1h scan; no Navidrome change
required. Reuses navidrome-music-pvc directly (RWO is fine because both
pods pin to r1 via nodeSelector). State (library.db, logs) lives on a
small local-path PVC, regenerable by deleting the PVC.
Files: f3s/beets-art/helm-chart/{Chart.yaml,README.md,templates/*.yaml}
f3s/beets-art/Justfile (status, logs, run-now, suspend, resume, shell)
f3s/argocd-apps/services/beets-art.yaml
Amp-Thread-ID: https://ampcode.com/threads/T-019e223a-d137-705e-879b-84130c0e78ea
Co-authored-by: Amp <amp@ampcode.com>
|
|
Pinning to a specific version avoids silent breaking upgrades and makes
kubectl rollout undo meaningful. IfNotPresent skips unnecessary re-pulls
on pod restarts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
SQLite over NFS causes two problems: file-lock races on rolling
restarts (fixed with Recreate strategy but underlying fragility
remains), and 19s image-cache init at startup due to stunnel TLS
latency on every cache read.
Replace navidrome-data-pv/pvc (static hostPath over NFS at
/data/nfs/k3svolumes/navidrome/data) with a dynamic local-path PVC
provisioned on r1 (/var/lib/rancher/k3s/storage). Pin the deployment
to r1 via nodeSelector so the local PV is always reachable.
Existing DB and cache migrated: navidrome.db (23 MB), image/background/
plugin caches (~118 MB) copied via a migration pod before first start.
Result: startupTime=41ms (was ~20s), Image cache init=29ms (was ~19s).
Music PVC stays on NFS (200 GB library, unchanged).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
node_exporter runs as uid 65534 (nobody); mktemp creates files with
mode 600 (root-only). Add chmod 644 before the atomic mv so the
node_exporter process can read nfs_mount_monitor.prom on its scrape.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Use extraHostVolumeMounts (prometheus-node-exporter sub-chart key for
host path mounts) instead of extraVolumes/extraVolumeMounts, which are
for general volumes. This correctly wires /var/lib/node_exporter/
textfile_collector into the container so the textfile arg takes effect.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge
to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on
every run (via write_textfile_metric helper, called from write_fail_count
and directly on healthy runs); atomic tmp+mv write prevents partial reads
- Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes
- prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter
DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path
/var/lib/node_exporter/textfile_collector into container
- persistence-values.yaml: sync node_exporter textfile_collector config
- nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts:
NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning)
NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical)
wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval
Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Persist a consecutive-failure counter to /var/lib/nfs-mount-monitor/fail-count.
Increment on every fix_mount failure; reset to 0 on any successful repair or
when all three probes pass cleanly. After NFS_FAIL_THRESHOLD (default 5, ~50s)
consecutive failures the node is cordoned via kubectl and rebooted with
'systemctl reboot' so the cluster stops routing pods to a silently broken node.
NFS_FAIL_THRESHOLD is configurable via /etc/default/nfs-mount-monitor (deployed
as EnvironmentFile in the .service unit) without touching the script.
Also fix Rexfile path resolution: __FILE__ inside a Rex task resolves to the
internal Rex loader path, not the Rexfile itself; use realpath($::rexfile)
instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Add lazy umount fallback, D-state process killer, stunnel restart, and
60-second hard deadline to prevent fix_mount from looping forever when
processes are stuck in D state on a stale NFSv4-over-stunnel mount.
Recovery sequence is now:
1. mount -o remount -f (cheap, no disruption)
2. kill_pinning_processes (SIGKILL D-state procs with nfs_ wchan)
3. umount -f (force unmount)
4. umount -l (lazy detach VFS node if -f failed)
5. systemctl restart stunnel + 2s sleep (refresh TLS transport)
6. mount (fresh mount)
The 60s deadline uses bash $SECONDS so fix_mount can never outlast its
own 10-second timer interval by an unbounded amount. Deployed to all
three r-nodes (r0/r1/r2) via rex nfs_mount_monitor.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Stunnel-wrapped NFSv4 can enter a half-broken state where mountpoint(1)
returns true and stat(1) completes from cache, but ALL writes hang
indefinitely. This was observed on r2 on 2026-05-10 causing navidrome
to be unschedulable. The existing two probes passed while writes were
dead.
Add a third probe (write-probe) after the stat probe: write the shell's
PID to a per-host .healthcheck.<hostname> file and immediately remove it,
wrapped in a 5-second timeout. The per-host filename prevents r0/r1/r2
from racing on the same file. 5s gives one full NFS retransmit window
(timeo=10 deciseconds = 1s, retrans=2) plus margin without making the
10-second timer run too long.
Deployed to r0/r1/r2 via rex nfs_mount_monitor; all three nodes
confirmed running the new script (journalctl shows clean exits).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Pull check-nfs-mount.sh, nfs-mount-monitor.service, and
nfs-mount-monitor.timer from r0/r1/r2 (confirmed identical on all
three nodes) into f3s/r-nodes/nfs-mount-monitor/. Add
f3s/r-nodes/Rexfile with an idempotent nfs_mount_monitor task that
pushes the files to all three r-nodes as root and reloads systemd when
content changes. Wire the new Rexfile into the repo root Rexfile.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
SQLite on RWO PVC can only have one writer. RollingUpdate keeps the old
pod alive until the new pod is Ready, but the new pod blocks
indefinitely on SQLite open because the old pod still holds the
db-shm/db-wal lock. Recreate kills the old pod first.
Amp-Thread-ID: https://ampcode.com/threads/T-019e109f-fa43-7467-bb0b-1a4a2d3d0b9e
Co-authored-by: Amp <amp@ampcode.com>
|
|
Prevents Traefik from routing traffic to navidrome before it has
finished its ~1m14s cold start (NFS cache warm-up + initial scan),
which was causing 501s and 'context canceled' errors right after
each cluster boot.
Amp-Thread-ID: https://ampcode.com/threads/T-019e109f-fa43-7467-bb0b-1a4a2d3d0b9e
Co-authored-by: Amp <amp@ampcode.com>
|
|
|
|
|
|
|
|
|
|
Bypass Traefik for anki-sync-server to fix HTTP 303 stream failures.
The Anki client maps zstd response body read errors to SEE_OTHER (303).
This was caused by Traefik's HTTP proxy layer interfering with the binary
zstd-compressed response bodies. Route directly to the anki NodePort like
Jellyfin's 30096, which avoids the double-proxy issue.
|
|
Exposes anki-sync-server directly on all k3s nodes at NodePort 30800,
bypassing Traefik. Used to isolate whether the HTTP 303 stream failures
(Anki client maps zstd body read errors to SEE_OTHER) originate in the
Traefik HTTP proxy layer or in the pod itself.
|
|
|
|
|
|
- service.yaml: add 'metrics' port (8080) so kubernetes SD auto-discovers
the /metrics endpoint alongside the existing http port (80)
- prometheus/manifests/goprecords-alerts.yaml: GoprecordsHostNotReporting
fires (warning) when a non-excluded host last reported >5 months ago
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Deploy goprecords-upload-client.sh from goprecords/scripts/ instead of the
inline-token template. Token is now stored in /etc/goprecords-upload.token
(mode 600) and the script reads it at runtime. Old goprecords-upload.sh
(token baked in, mode 500) is removed. daily.local entry updated to pass
GOPRECORDS_HOST=<host> as environment variable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
Add POSIX sh script template deployed to /usr/local/bin/goprecords-upload.sh,
invoked from /etc/daily.local. Rex task goprecords_upload installs curl, renders
per-host script from geheim secrets/etc/goprecords/<host>.token, and hooks
commons. Document token layout and kubectl key creation in README.
Made-with: Cursor
|
|
Made-with: Cursor
|
|
Made-with: Cursor
|
|
Bump Helm and docker-image tags for the new goprecords release.
Made-with: Cursor
|
|
Run the goprecords pod as root and keep the hostPath PV type aligned with the existing immutable volume configuration so ArgoCD can sync cleanly while the service can create and open its auth database on the shared stats path.
Made-with: Cursor
|
|
Introduce Docker build/push workflow, Helm manifests, and ArgoCD application wiring for goprecords so the cluster can deploy the new daemon API service from the private registry.
Made-with: Cursor
|
|
|
|
|
|
Add dnsmasq.d wildcard for *.f3s.lan.buetow.org → 192.168.1.138 and
example compose for Pis; refresh README (DNS on pi2/pi3, etc-dnsmasq.d).
Align dormant ArgoCD Helm customDnsEntries with the same wildcard.
Made-with: Cursor
|
|
|
|
|