| Age | Commit message (Collapse) | Author |
|
Grafana's SQLite-on-NFS persistence is unreliable across restarts (the
new pod can't reacquire a clean exclusive lock after any NFS bounce),
and with Loki + Tempo also gone there's nothing left for it to
visualize. Keeping Prometheus alone for metrics + alerting.
Changes:
- prometheus.yaml: add grafana.enabled=false in the kube-prometheus-stack
values so the subchart no longer renders the grafana deployment/pvc.
- loki.yaml, tempo.yaml, grafana-ingress.yaml: renamed to .disabled
(same pattern as commit 03a18c6) so 'kubectl apply -f argocd-apps/'
stops re-creating them; the cluster Applications were also deleted,
which cascade-removes the helm resources via the resources-finalizer.
- alloy.yaml: drop the loki.write and otelcol.* blocks (no destinations
to ship to). DaemonSet stays deployed with a minimal 'logging' block
so the chart can be re-enabled by restoring the blocks here.
Prometheus TSDB was also wiped (corrupted zero-byte WAL segments from
the same NFS blip that took grafana down) — done separately, not part
of this commit.
|
|
Use extraHostVolumeMounts (prometheus-node-exporter sub-chart key for
host path mounts) instead of extraVolumes/extraVolumeMounts, which are
for general volumes. This correctly wires /var/lib/node_exporter/
textfile_collector into the container so the textfile arg takes effect.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge
to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on
every run (via write_textfile_metric helper, called from write_fail_count
and directly on healthy runs); atomic tmp+mv write prevents partial reads
- Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes
- prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter
DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path
/var/lib/node_exporter/textfile_collector into container
- persistence-values.yaml: sync node_exporter textfile_collector config
- nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts:
NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning)
NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical)
wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval
Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
- ArgoCD app: aquasecurity/trivy-operator in monitoring with ServiceMonitor
- PrometheusRule for Critical/High trivy_image_vulnerabilities alerts
- Alertmanager route/receiver for component=trivy (UI; webhook TBD)
Made-with: Cursor
|
|
Changes all application manifests to use HTTP git backend instead of SSH:
- From: ssh://git@git-server.cicd.svc.cluster.local/repos/repos/conf.git
- To: http://git-server.cicd.svc.cluster.local/conf.git
Benefits:
- No SSH agent or key management required
- No issues with changing SSH host keys on pod restarts
- Simpler ArgoCD configuration
- HTTP git-http-backend now fully functional
Updated applications:
- monitoring: prometheus, grafana-ingress, pushgateway (3)
- services: anki-sync-server, audiobookshelf, filebrowser, immich, keybr,
kobo-sync-server, miniflux, opodsync, radicale, syncthing, tracing-demo,
wallabag, webdav (13)
- infra: registry (1)
- test: example-apache-volume-claim (1)
Total: 18 applications migrated to HTTP
|
|
Updated 17 application manifests to use internal git-server:
- Monitoring: grafana-ingress, prometheus, pushgateway
- Services: anki-sync-server, audiobookshelf, filebrowser, immich, keybr,
kobo-sync-server, miniflux, opodsync, radicale, syncthing,
tracing-demo, wallabag, webdav
- Infra: registry
All applications now fetch from:
ssh://git@git-server.cicd.svc.cluster.local/repos/repos/conf.git
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
|
|
|
|
K3s embeds kube-proxy and kube-scheduler functionality into the main
k3s server process, unlike standard Kubernetes where they run as
separate components.
This change disables monitoring for these components to prevent
false-positive critical alerts:
- KubeProxyDown
- KubeSchedulerDown
These alerts were firing because kube-prometheus-stack expects
standard Kubernetes architecture with separate kube-proxy and
kube-scheduler pods/processes.
Cluster info:
- Running k3s v1.32.6+k3s1
- 3 control-plane nodes (r0, r1, r2)
- Components embedded in k3s binary
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Added Alertmanager configuration to:
- Route ArgoCD application alerts to dedicated 'argocd-alerts' receiver
- Group ArgoCD alerts by alertname, name (app name), and severity
- Faster alert grouping for ArgoCD (10s wait vs 30s default)
- Repeat ArgoCD alerts every 6 hours
- Suppress Watchdog test alerts
- Configure inhibit rules to prevent alert spam
Alerts are visible in:
- Prometheus UI: http://localhost:9090/alerts
- Alertmanager UI: http://localhost:9093
- Grafana dashboard: ArgoCD Applications - Health & Sync Status
This ensures critical application issues are properly routed and visible
in the monitoring UI for immediate action.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Create subdirectories: monitoring/, services/, infra/, test/
- Move 6 monitoring apps to monitoring/
- Move 13 service apps to services/
- Move 1 infra app to infra/
- Move 1 test app to test/
- Add README.md documenting the structure and usage
This organization:
- Makes it easier to understand which apps belong to which namespace
- Allows applying apps by namespace: kubectl apply -f argocd-apps/monitoring/
- Supports namespace-scoped app-of-apps patterns
- Provides better clarity when browsing the repository
All 21 applications remain functional and validated with kubectl --dry-run.
|