summaryrefslogtreecommitdiff
path: root/f3s/prometheus/manifests
AgeCommit message (Collapse)Author
26 hoursAdd f3 (192.168.2.133) to Prometheus node_exporter scrape targetsPaul Buetow
f3 was previously excluded from FreeBSD host monitoring. Now that node_exporter is installed and running there, include it in the node-exporter job so CPU temperature and other host metrics are collected alongside f0/f1/f2. Also update the temperature alert comment to reflect that f3 is now covered.
2 daysgoprecords: bump image to 0.5.2Paul Buetow
2026-05-10nfs-monitor: add Prometheus alerts for NFS auto-repair failuresPaul Buetow
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on every run (via write_textfile_metric helper, called from write_fail_count and directly on healthy runs); atomic tmp+mv write prevents partial reads - Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes - prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path /var/lib/node_exporter/textfile_collector into container - persistence-values.yaml: sync node_exporter textfile_collector config - nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts: NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning) NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical) wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16goprecords: restore for:1h after alert testPaul Buetow
2026-04-16goprecords: temp set for:1m for alert testPaul Buetow
2026-04-16goprecords: add Prometheus scraping and stale-host alert rulePaul Buetow
- service.yaml: add 'metrics' port (8080) so kubernetes SD auto-discovers the /metrics endpoint alongside the existing http port (80) - prometheus/manifests/goprecords-alerts.yaml: GoprecordsHostNotReporting fires (warning) when a non-excluded host last reported >5 months ago Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08feat(f3s): deploy Trivy Operator for image CVE scanning (task h)Paul Buetow
- ArgoCD app: aquasecurity/trivy-operator in monitoring with ServiceMonitor - PrometheusRule for Critical/High trivy_image_vulnerabilities alerts - Alertmanager route/receiver for component=trivy (UI; webhook TBD) Made-with: Cursor
2026-04-08f3s/prometheus: add Garage admin scrape targets (task f)Paul Buetow
Add job_name garage for 192.168.2.130-132:3903 with os=freebsd label. Mirror config in additional-scrape-configs-secret for kube apply/ArgoCD. Made-with: Cursor
2026-01-19resolve merge conflict in argocd dashboardPaul Buetow
Kept the version with the additional "Unhealthy Applications" panel which provides better visibility into problematic applications. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19fix radicale scrape config causing TargetDown alertPaul Buetow
Radicale does not expose Prometheus metrics. The previous config tried to scrape /.web/ which returns HTML, causing parse errors. Synced with additional-scrape-configs.yaml which properly drops radicale from scraping. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19Merge branch 'master' of codeberg.org:snonux/confPaul Buetow
2026-01-18Add unhealthy applications panel to ArgoCD dashboardPaul Buetow
Adds a dedicated table panel showing only applications with health_status != "Healthy" for quick identification of issues. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-15Update monitoring and gogios configurationPaul Buetow
- Add node resources multi-select dashboard for Prometheus - Update gogios cron schedule and add HTML status file output - Update Prometheus scrape configs - Add gogios documentation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08Add NodePort service for Prometheus on port 30090Paul Buetow
2026-01-08Add Grafana dashboard for ArgoCD applications monitoringPaul Buetow
Created comprehensive Grafana dashboard showing: - Total applications count - Healthy vs unhealthy applications - Out-of-sync status - Detailed table with all applications and their status - Health status timeline graph - Sync operations rate - Active ArgoCD-related alerts Dashboard will auto-load in Grafana via ConfigMap with label grafana_dashboard='1' Access at: https://grafana.f3s.buetow.org → Dashboards → ArgoCD Applications Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08Add comprehensive ArgoCD application monitoring and alertsPaul Buetow
This implements monitoring for ALL services deployed via ArgoCD by leveraging ArgoCD's native Prometheus metrics instead of scraping individual services. Changes: - Created ArgoCD application alerts for health and sync status monitoring - Alert when applications are unhealthy (Degraded, Missing, Unknown, Suspended) - Alert when applications are out of sync for >10 minutes - Alert when sync operations are failing repeatedly - Alert when applications are stuck in Progressing state - Added recording rules for unhealthy/out-of-sync application counts - Added radicale health monitoring via scrape config - Added radicale to additional-scrape-configs for direct health checks - Monitors radicale web interface availability Benefits: - Single monitoring solution for all 21 ArgoCD-managed applications - Automatic monitoring for new applications added to ArgoCD - Early detection of configuration drift and deployment issues - Centralized alerting with actionable remediation steps Monitored applications include: radicale, registry, alloy, grafana, loki, prometheus, tempo, anki-sync-server, audiobookshelf, filebrowser, immich, keybr, kobo-sync-server, miniflux, opodsync, and more. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07Prepare Prometheus for ArgoCD GitOps migrationPaul Buetow
- Created manifests/ directory with all additional resources - Added sync wave annotations for proper ordering - Created PostSync hook for Grafana pod restart - Converted additional-scrape-configs to Kubernetes Secret - Organized: PVs (wave 0), Secrets/ConfigMaps (wave 1), PrometheusRules (wave 3), Dashboards (wave 4), Hook (wave 10) - Created multi-source ArgoCD Application (upstream chart + manifests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>