summaryrefslogtreecommitdiff
path: root/f3s/argocd-apps/monitoring/prometheus.yaml
AgeCommit message (Collapse)Author
2026-05-16f3s/monitoring: disable grafana, loki, tempo; reduce alloy to no-opPaul Buetow
Grafana's SQLite-on-NFS persistence is unreliable across restarts (the new pod can't reacquire a clean exclusive lock after any NFS bounce), and with Loki + Tempo also gone there's nothing left for it to visualize. Keeping Prometheus alone for metrics + alerting. Changes: - prometheus.yaml: add grafana.enabled=false in the kube-prometheus-stack values so the subchart no longer renders the grafana deployment/pvc. - loki.yaml, tempo.yaml, grafana-ingress.yaml: renamed to .disabled (same pattern as commit 03a18c6) so 'kubectl apply -f argocd-apps/' stops re-creating them; the cluster Applications were also deleted, which cascade-removes the helm resources via the resources-finalizer. - alloy.yaml: drop the loki.write and otelcol.* blocks (no destinations to ship to). DaemonSet stays deployed with a minimal 'logging' block so the chart can be re-enabled by restoring the blocks here. Prometheus TSDB was also wiped (corrupted zero-byte WAL segments from the same NFS blip that took grafana down) — done separately, not part of this commit.
2026-05-10nfs-monitor: fix node_exporter textfile_collector Helm chart keyPaul Buetow
Use extraHostVolumeMounts (prometheus-node-exporter sub-chart key for host path mounts) instead of extraVolumes/extraVolumeMounts, which are for general volumes. This correctly wires /var/lib/node_exporter/ textfile_collector into the container so the textfile arg takes effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10nfs-monitor: add Prometheus alerts for NFS auto-repair failuresPaul Buetow
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on every run (via write_textfile_metric helper, called from write_fail_count and directly on healthy runs); atomic tmp+mv write prevents partial reads - Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes - prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path /var/lib/node_exporter/textfile_collector into container - persistence-values.yaml: sync node_exporter textfile_collector config - nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts: NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning) NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical) wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08feat(f3s): deploy Trivy Operator for image CVE scanning (task h)Paul Buetow
- ArgoCD app: aquasecurity/trivy-operator in monitoring with ServiceMonitor - PrometheusRule for Critical/High trivy_image_vulnerabilities alerts - Alertmanager route/receiver for component=trivy (UI; webhook TBD) Made-with: Cursor
2026-01-10Migrate all ArgoCD applications from SSH to HTTP git URLsPaul Buetow
Changes all application manifests to use HTTP git backend instead of SSH: - From: ssh://git@git-server.cicd.svc.cluster.local/repos/repos/conf.git - To: http://git-server.cicd.svc.cluster.local/conf.git Benefits: - No SSH agent or key management required - No issues with changing SSH host keys on pod restarts - Simpler ArgoCD configuration - HTTP git-http-backend now fully functional Updated applications: - monitoring: prometheus, grafana-ingress, pushgateway (3) - services: anki-sync-server, audiobookshelf, filebrowser, immich, keybr, kobo-sync-server, miniflux, opodsync, radicale, syncthing, tracing-demo, wallabag, webdav (13) - infra: registry (1) - test: example-apache-volume-claim (1) Total: 18 applications migrated to HTTP
2026-01-09Migrate all applications from Codeberg to self-hosted gitPaul Buetow
Updated 17 application manifests to use internal git-server: - Monitoring: grafana-ingress, prometheus, pushgateway - Services: anki-sync-server, audiobookshelf, filebrowser, immich, keybr, kobo-sync-server, miniflux, opodsync, radicale, syncthing, tracing-demo, wallabag, webdav - Infra: registry All applications now fetch from: ssh://git@git-server.cicd.svc.cluster.local/repos/repos/conf.git Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08prunePaul Buetow
2026-01-08Fix: disable kubeScheduler rules entirelyPaul Buetow
2026-01-08Disable KubeProxyDown and KubeSchedulerDown alerts for k3sPaul Buetow
2026-01-08Disable kube-proxy and kube-scheduler monitoring for k3sPaul Buetow
K3s embeds kube-proxy and kube-scheduler functionality into the main k3s server process, unlike standard Kubernetes where they run as separate components. This change disables monitoring for these components to prevent false-positive critical alerts: - KubeProxyDown - KubeSchedulerDown These alerts were firing because kube-prometheus-stack expects standard Kubernetes architecture with separate kube-proxy and kube-scheduler pods/processes. Cluster info: - Running k3s v1.32.6+k3s1 - 3 control-plane nodes (r0, r1, r2) - Components embedded in k3s binary Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08Configure Alertmanager routing for ArgoCD application alertsPaul Buetow
Added Alertmanager configuration to: - Route ArgoCD application alerts to dedicated 'argocd-alerts' receiver - Group ArgoCD alerts by alertname, name (app name), and severity - Faster alert grouping for ArgoCD (10s wait vs 30s default) - Repeat ArgoCD alerts every 6 hours - Suppress Watchdog test alerts - Configure inhibit rules to prevent alert spam Alerts are visible in: - Prometheus UI: http://localhost:9090/alerts - Alertmanager UI: http://localhost:9093 - Grafana dashboard: ArgoCD Applications - Health & Sync Status This ensures critical application issues are properly routed and visible in the monitoring UI for immediate action. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07Reorganize argocd-apps by namespace for better structurePaul Buetow
- Create subdirectories: monitoring/, services/, infra/, test/ - Move 6 monitoring apps to monitoring/ - Move 13 service apps to services/ - Move 1 infra app to infra/ - Move 1 test app to test/ - Add README.md documenting the structure and usage This organization: - Makes it easier to understand which apps belong to which namespace - Allows applying apps by namespace: kubectl apply -f argocd-apps/monitoring/ - Supports namespace-scoped app-of-apps patterns - Provides better clarity when browsing the repository All 21 applications remain functional and validated with kubectl --dry-run.