| Age | Commit message (Collapse) | Author |
|
f3 was previously excluded from FreeBSD host monitoring. Now that
node_exporter is installed and running there, include it in the
node-exporter job so CPU temperature and other host metrics are
collected alongside f0/f1/f2. Also update the temperature alert
comment to reflect that f3 is now covered.
|
|
|
|
- check-nfs-mount.sh: write nfs_mount_monitor_consecutive_failures gauge
to /var/lib/node_exporter/textfile_collector/nfs_mount_monitor.prom on
every run (via write_textfile_metric helper, called from write_fail_count
and directly on healthy runs); atomic tmp+mv write prevents partial reads
- Rexfile: create /var/lib/node_exporter/textfile_collector dir on r-nodes
- prometheus.yaml (ArgoCD app): enable textfile_collector in node_exporter
DaemonSet via extraArgs/extraVolumes/extraVolumeMounts; mount host path
/var/lib/node_exporter/textfile_collector into container
- persistence-values.yaml: sync node_exporter textfile_collector config
- nfs-mount-monitor-alerts.yaml: PrometheusRule with two alerts:
NfsMountAutoRepairWarning (>= 3 consecutive failures, severity: warning)
NfsMountAutoRepairCritical (>= 5 consecutive failures, severity: critical)
wired into new 'nfs-alerts' Alertmanager receiver with 30m repeat_interval
Tested: rex deploy succeeded, .prom files present on r0/r1/r2, timer clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
|
|
|
|
- service.yaml: add 'metrics' port (8080) so kubernetes SD auto-discovers
the /metrics endpoint alongside the existing http port (80)
- prometheus/manifests/goprecords-alerts.yaml: GoprecordsHostNotReporting
fires (warning) when a non-excluded host last reported >5 months ago
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
- ArgoCD app: aquasecurity/trivy-operator in monitoring with ServiceMonitor
- PrometheusRule for Critical/High trivy_image_vulnerabilities alerts
- Alertmanager route/receiver for component=trivy (UI; webhook TBD)
Made-with: Cursor
|
|
Add job_name garage for 192.168.2.130-132:3903 with os=freebsd label.
Mirror config in additional-scrape-configs-secret for kube apply/ArgoCD.
Made-with: Cursor
|
|
Kept the version with the additional "Unhealthy Applications" panel
which provides better visibility into problematic applications.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
Radicale does not expose Prometheus metrics. The previous config tried
to scrape /.web/ which returns HTML, causing parse errors. Synced with
additional-scrape-configs.yaml which properly drops radicale from scraping.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
|
|
Adds a dedicated table panel showing only applications with
health_status != "Healthy" for quick identification of issues.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
|
- Add node resources multi-select dashboard for Prometheus
- Update gogios cron schedule and add HTML status file output
- Update Prometheus scrape configs
- Add gogios documentation
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
Created comprehensive Grafana dashboard showing:
- Total applications count
- Healthy vs unhealthy applications
- Out-of-sync status
- Detailed table with all applications and their status
- Health status timeline graph
- Sync operations rate
- Active ArgoCD-related alerts
Dashboard will auto-load in Grafana via ConfigMap with label grafana_dashboard='1'
Access at: https://grafana.f3s.buetow.org → Dashboards → ArgoCD Applications
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
This implements monitoring for ALL services deployed via ArgoCD by leveraging ArgoCD's native Prometheus metrics instead of scraping individual services.
Changes:
- Created ArgoCD application alerts for health and sync status monitoring
- Alert when applications are unhealthy (Degraded, Missing, Unknown, Suspended)
- Alert when applications are out of sync for >10 minutes
- Alert when sync operations are failing repeatedly
- Alert when applications are stuck in Progressing state
- Added recording rules for unhealthy/out-of-sync application counts
- Added radicale health monitoring via scrape config
- Added radicale to additional-scrape-configs for direct health checks
- Monitors radicale web interface availability
Benefits:
- Single monitoring solution for all 21 ArgoCD-managed applications
- Automatic monitoring for new applications added to ArgoCD
- Early detection of configuration drift and deployment issues
- Centralized alerting with actionable remediation steps
Monitored applications include: radicale, registry, alloy, grafana, loki,
prometheus, tempo, anki-sync-server, audiobookshelf, filebrowser, immich,
keybr, kobo-sync-server, miniflux, opodsync, and more.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Created manifests/ directory with all additional resources
- Added sync wave annotations for proper ordering
- Created PostSync hook for Grafana pod restart
- Converted additional-scrape-configs to Kubernetes Secret
- Organized: PVs (wave 0), Secrets/ConfigMaps (wave 1), PrometheusRules (wave 3), Dashboards (wave 4), Hook (wave 10)
- Created multi-source ArgoCD Application (upstream chart + manifests)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|