| Age | Commit message (Collapse) | Author |
|
Amp-Thread-ID: https://ampcode.com/threads/T-019b9eec-b607-7271-9b75-f05255a60742
Co-authored-by: Amp <amp@ampcode.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
K3s embeds kube-proxy and kube-scheduler functionality into the main
k3s server process, unlike standard Kubernetes where they run as
separate components.
This change disables monitoring for these components to prevent
false-positive critical alerts:
- KubeProxyDown
- KubeSchedulerDown
These alerts were firing because kube-prometheus-stack expects
standard Kubernetes architecture with separate kube-proxy and
kube-scheduler pods/processes.
Cluster info:
- Running k3s v1.32.6+k3s1
- 3 control-plane nodes (r0, r1, r2)
- Components embedded in k3s binary
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Added enhanced port-forward targets with helpful UI information:
- 'just alerts' - Quick access to Prometheus alerts view
- 'just alertmanager' - Quick access to Alertmanager UI
- Enhanced output showing all relevant URLs
All port-forward commands now display:
- Access URLs with direct links to specific views
- Clear instructions for stopping (Ctrl+C)
Usage:
cd prometheus/
just alerts # Opens Prometheus alerts (port 9090)
just alertmanager # Opens Alertmanager (port 9093)
just port-forward-prometheus [port]
just port-forward-grafana [port]
After running, access:
- Prometheus Alerts: http://localhost:9090/alerts
- Alertmanager: http://localhost:9093
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Added Alertmanager configuration to:
- Route ArgoCD application alerts to dedicated 'argocd-alerts' receiver
- Group ArgoCD alerts by alertname, name (app name), and severity
- Faster alert grouping for ArgoCD (10s wait vs 30s default)
- Repeat ArgoCD alerts every 6 hours
- Suppress Watchdog test alerts
- Configure inhibit rules to prevent alert spam
Alerts are visible in:
- Prometheus UI: http://localhost:9090/alerts
- Alertmanager UI: http://localhost:9093
- Grafana dashboard: ArgoCD Applications - Health & Sync Status
This ensures critical application issues are properly routed and visible
in the monitoring UI for immediate action.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
Created comprehensive Grafana dashboard showing:
- Total applications count
- Healthy vs unhealthy applications
- Out-of-sync status
- Detailed table with all applications and their status
- Health status timeline graph
- Sync operations rate
- Active ArgoCD-related alerts
Dashboard will auto-load in Grafana via ConfigMap with label grafana_dashboard='1'
Access at: https://grafana.f3s.buetow.org → Dashboards → ArgoCD Applications
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
This implements monitoring for ALL services deployed via ArgoCD by leveraging ArgoCD's native Prometheus metrics instead of scraping individual services.
Changes:
- Created ArgoCD application alerts for health and sync status monitoring
- Alert when applications are unhealthy (Degraded, Missing, Unknown, Suspended)
- Alert when applications are out of sync for >10 minutes
- Alert when sync operations are failing repeatedly
- Alert when applications are stuck in Progressing state
- Added recording rules for unhealthy/out-of-sync application counts
- Added radicale health monitoring via scrape config
- Added radicale to additional-scrape-configs for direct health checks
- Monitors radicale web interface availability
Benefits:
- Single monitoring solution for all 21 ArgoCD-managed applications
- Automatic monitoring for new applications added to ArgoCD
- Early detection of configuration drift and deployment issues
- Centralized alerting with actionable remediation steps
Monitored applications include: radicale, registry, alloy, grafana, loki,
prometheus, tempo, anki-sync-server, audiobookshelf, filebrowser, immich,
keybr, kobo-sync-server, miniflux, opodsync, and more.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Create subdirectories: monitoring/, services/, infra/, test/
- Move 6 monitoring apps to monitoring/
- Move 13 service apps to services/
- Move 1 infra app to infra/
- Move 1 test app to test/
- Add README.md documenting the structure and usage
This organization:
- Makes it easier to understand which apps belong to which namespace
- Allows applying apps by namespace: kubectl apply -f argocd-apps/monitoring/
- Supports namespace-scoped app-of-apps patterns
- Provides better clarity when browsing the repository
All 21 applications remain functional and validated with kubectl --dry-run.
|
|
- Created ArgoCD Application for grafana-ingress
- Simple custom Helm chart exposing Grafana via Traefik
- Updated Justfile with ArgoCD commands
- Status: Synced and Healthy
- Ingress working at https://grafana.f3s.buetow.org
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Successfully migrated kube-prometheus-stack to ArgoCD
- Multi-source Application: upstream chart + manifests directory
- PostSync hook automatically restarts Grafana to reload datasources
- All recording rules applied (FreeBSD, OpenBSD, ZFS)
- All dashboards provisioned
- Grafana datasources configured (Prometheus, Loki, Tempo, Alertmanager)
- Updated Justfile with ArgoCD commands
- Status: Synced and Healthy
- Grafana restarted successfully by PostSync hook
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Created manifests/ directory with all additional resources
- Added sync wave annotations for proper ordering
- Created PostSync hook for Grafana pod restart
- Converted additional-scrape-configs to Kubernetes Secret
- Organized: PVs (wave 0), Secrets/ConfigMaps (wave 1), PrometheusRules (wave 3), Dashboards (wave 4), Hook (wave 10)
- Created multi-source ArgoCD Application (upstream chart + manifests)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Created two ArgoCD Application manifests (loki and alloy)
- Updated Justfile with ArgoCD commands for both apps
- Loki: log aggregation (SingleBinary mode, 10Gi storage)
- Alloy: log collection DaemonSet + OTLP receiver for traces
- Both apps are Synced and Healthy
- Alloy forwards logs to Loki and traces to Tempo
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Created ArgoCD Application manifest for Tempo
- Updated Justfile with ArgoCD commands (sync, argocd-status, restart)
- Tested delete/re-deploy workflow
- Verified Tempo is Synced and Healthy
- OTLP receivers enabled on ports 4317 (gRPC) and 4318 (HTTP)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
pushgateway, immich
Apps migrated in this commit:
- example-apache-volume-claim (test namespace, 2 replicas, 1 PVC)
- registry (infra namespace, Docker registry, 1 PVC)
- pushgateway (monitoring namespace, Prometheus metrics)
- immich (multi-component: server, postgres, valkey, ML)
Also:
- Deleted unused example-apache directory
- Updated all Justfiles with ArgoCD commands
- All apps synced and healthy
Progress: 16/22 active apps (73%)
Remaining apps (all in monitoring namespace):
- prometheus (kube-prometheus-stack)
- loki (umbrella chart)
- tempo
- grafana-ingress
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Apps migrated in this commit:
- anki-sync-server (custom images, 1 PVC)
- syncthing (file sync, 2 PVCs)
- audiobookshelf (3 PVCs)
- radicale (CalDAV/CardDAV)
- opodsync (podcast sync, 2-container pod)
- kobo-sync-server (eReader sync)
- filebrowser (3 PVCs)
- webdav (WebDAV server)
All apps:
- Created ArgoCD Application manifests
- Updated Justfiles with ArgoCD commands
- All synced successfully and healthy
- Zero downtime migrations
Also includes:
- Updated migration progress tracker (12/23 apps, 52%)
- Deleted freshrss directory (app no longer needed)
Progress: 12/23 apps (52%)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Added ArgoCD Application manifests for wallabag and keybr
- Updated Justfiles to use ArgoCD commands (sync, argocd-status, restart)
- Removed Helm commands (install, upgrade, delete)
- Tested delete/re-deploy workflow for both apps
- All resources sync successfully, zero downtime
Apps migrated: 4/23 (17%)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Testing ArgoCD auto-sync functionality by scaling the tracing-demo
frontend deployment from 1 to 2 replicas. This validates the complete
GitOps workflow: commit → push → auto-sync → deployment.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
Document how gogios.json.tpl handles server-specific vs service domain checks:
- Dedicated bare hostname checks for server FQDNs
- Service domain checks with all prefix variants
- Why server hostnames must be skipped in @acme_hosts loop
- Impact of not skipping: 12 false critical alerts
Explains the same skip pattern used across httpd.conf.tpl, relayd.conf.tpl,
and gogios.json.tpl for consistent handling of server-specific hostnames.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Skip blowfish.buetow.org and fishfinger.buetow.org in the @acme_hosts loop
that creates monitoring checks for www and standby prefix variants.
These server-specific hostnames:
- Don't have DNS records for www/standby prefixes
- Already have dedicated bare hostname checks (lines 29-46)
- Should only be monitored without prefix variants
This prevents 12 false critical alerts for non-existent:
- www.blowfish.buetow.org
- standby.blowfish.buetow.org
- www.fishfinger.buetow.org
- standby.fishfinger.buetow.org
Follows same pattern as httpd.conf.tpl and relayd.conf.tpl where server
hostnames are skipped in shared configuration loops.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Removed troubleshooting narrative and restructured to document the
system architecture, configuration patterns, and operational knowledge.
Now covers:
- Architecture overview and component responsibilities
- Configuration array roles (@acme_hosts, @f3s_hosts, @prefixes)
- Template processing and variable scoping
- Routing configuration logic
- TLS certificate management in multi-server deployments
- Server block patterns and duplicate prevention
- Server-specific vs. shared host configuration
- Deployment process and testing procedures
- Monitoring system (Gogios) behavior
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Documents the investigation process, root cause analysis, and key learnings
from debugging the blowfish/fishfinger 404 errors. Includes:
- Architecture overview of relayd + httpd routing
- Template variable scoping and processing
- Common pitfalls with server-specific vs shared configuration
- TLS certificate management in multi-server deployments
- Debugging methodology and verification approaches
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Added blowfish.buetow.org and fishfinger.buetow.org to @acme_hosts array
to ensure proper routing through relayd to localhost instead of falling
through to f3s cluster backends.
Changes:
- Rexfile: Add blowfish.buetow.org and fishfinger.buetow.org to @acme_hosts
- httpd.conf.tpl: Skip current server hostname in @acme_hosts loop to avoid
duplicate server blocks (already handled by dedicated "Current server's FQDN" block)
- relayd.conf.tpl: Skip both server hostnames in TLS keypair loop since each
server only has its own certificate (not the other server's cert)
This ensures relayd routes these hostnames to localhost:8080 where httpd
serves content from /htdocs/buetow.org/self including index.txt health checks.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
|
|
|
|
- Add http websockets directive to relayd.conf.tpl to allow WebSocket upgrade connections
- Fix "Socket failed to connect" error in audiobookshelf web interface
- Also add immich helm chart configuration
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
|
|
Updated Prometheus documentation to reflect current configuration:
- Added web.enable-admin-api flag documentation
- Updated outOfOrderTimeWindow from 720h to 744h (31 days)
- Added Data Deletion section with cleanup script usage
- Documented manual deletion via Admin API endpoints
Provides complete guide for data cleanup after benchmark testing.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
Added web.enable-admin-api flag to allow selective deletion of time series data
via the /api/v1/admin/tsdb endpoints. This enables cleanup of benchmark data
using the delete_series and clean_tombstones APIs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Changes:
- outOfOrderTimeWindow: 720h → 744h (30 days → 31 days)
Rationale:
Provides 1-day buffer for 30-day backfill operations to avoid edge
case rejections where the oldest samples exceed the limit due to
timing variations between data generation and ingestion.
With this configuration:
- 30-day benchmarks achieve 99.85% success rate (vs 50% with 720h)
- Only 4/2592 batches rejected (first few batches slightly over 30d)
- Allows safe backfilling of up to 30 days of historic data
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
This commit configures Prometheus to accept historic data via the Remote
Write API, enabling backfilling of test metrics for development and
troubleshooting purposes.
Changes:
- Enable Remote Write receiver (--web.enable-remote-write-receiver)
- Enable out-of-order ingestion with 30-day window (720h)
- Enable exemplar-storage and otlp-write-receiver features
- Add Epimetheus dashboard ConfigMap for Grafana provisioning
- Remove old prometheus-pusher directory (moved to separate repo)
- Document configuration, use cases, and performance considerations
Configuration allows backfilling data up to 30 days in the past, supporting
tools like Epimetheus for generating synthetic historic metrics.
Performance note: This is optimized for ad-hoc troubleshooting, not
production use. Out-of-order ingestion increases memory usage, TSDB overhead,
and may impact query performance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Updated persistence-values.yaml to enable the Remote Write receiver
using the correct flag for Prometheus 3.x:
- Changed from enableFeatures (not supported in 3.8.1)
- To additionalArgs with web.enable-remote-write-receiver
This allows Epimetheus to push historic data with preserved timestamps
via the Prometheus Remote Write API endpoint (/api/v1/write).
Applied via: just upgrade
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
- Merged content from 10 separate .md files into README.md
- Removed: ANSWER.md, AUTO-MODE.md, DASHBOARD.md, HISTORIC.md, LIMITATIONS.md,
QUERY_EXAMPLES.md, QUICK-START.md, SETUP-COMPLETE.md, SUMMARY.md, USAGE.md
- README.md now includes:
* Quick start guide
* All operating modes (realtime, historic, backfill, auto)
* Data formats (CSV, JSON)
* Test metrics documentation
* Grafana dashboard setup
* Example queries and curl commands
* Time range limitations
* Troubleshooting guide
* Architecture diagram
* Best practices
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Renamed all test metrics with "prometheus_pusher_test_" prefix to clearly
indicate they are generated by the prometheus-pusher testing/demo functionality.
Metric renaming:
- app_requests_total → prometheus_pusher_test_requests_total
- app_active_connections → prometheus_pusher_test_active_connections
- app_temperature_celsius → prometheus_pusher_test_temperature_celsius
- app_request_duration_seconds → prometheus_pusher_test_request_duration_seconds
- app_jobs_processed_total → prometheus_pusher_test_jobs_processed_total
Grafana Dashboard:
- Created comprehensive dashboard with 8 panels
- Request rate and total requests visualization
- Active connections gauge (0-100 with thresholds)
- Temperature gauge (0-50°C with thresholds)
- Request duration percentiles (p50, p90, p99)
- Average request duration stat
- Jobs processed by type (bar gauge)
- Jobs status breakdown table
- Auto-refresh every 10s, 15-minute default time range
Files added:
- grafana-dashboard.json: Dashboard definition
- deploy-dashboard.sh: Automated deployment script
- DASHBOARD.md: Complete dashboard documentation
Updated:
- internal/metrics/generator.go: Renamed metric names
- internal/ingester/remotewrite.go: Updated historic metric names
- internal/ingester/remotewrite_test.go: Updated test expectations
Tests updated and passing with 63.9% coverage maintained.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Document demonstrates actual curl commands and their outputs for all
metric types ingested by prometheus-pusher:
- Counter metrics (app_requests_total)
- Gauge metrics (app_temperature_celsius, app_active_connections)
- Histogram metrics (app_request_duration_seconds with buckets, sum, count)
- Labeled counter metrics (app_jobs_processed_total with multiple label combinations)
Includes:
- Complete curl commands
- Actual JSON responses from Prometheus API
- Explanations of each metric type
- Additional query examples (filters, ranges, aggregations)
Verifies data ingestion works correctly with real query results.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Implemented unit tests across all internal packages to achieve
63.9% test coverage, exceeding the 60% target.
Test coverage by package:
- internal/config: 100.0% (config validation, constants)
- internal/metrics: 100.0% (Sample methods, Collectors, Simulate)
- internal/parser: 92.3% (CSV/JSON parsing, format detection)
- internal/ingester: 44.9% (auto routing, time series conversion)
New test files:
- internal/config/config_test.go: Config creation and constants
- internal/metrics/sample_test.go: Sample type methods (Age, IsRecent)
- internal/metrics/generator_test.go: Collectors and simulation
- internal/parser/csv_test.go: CSV parsing with various inputs
- internal/parser/json_test.go: JSON parsing and validation
- internal/parser/parser_test.go: Parser factory and format handling
- internal/ingester/auto_test.go: Auto mode routing logic
- internal/ingester/remotewrite_test.go: Time series conversion
- internal/ingester/pushgateway_test.go: Pushgateway ingester
Tests cover:
- Happy path and error cases
- Context cancellation support
- Edge cases (empty input, invalid formats)
- Label parsing and timestamp handling
- Metric type generation (counter, gauge, histogram)
- Table-driven tests for comprehensive coverage
All 50+ tests passing ✅
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
Major refactoring to improve code organization and follow Go conventions:
- Moved main entry point to cmd/prometheus-pusher/main.go
- Organized code into internal packages (config, metrics, parser, ingester, version)
- Implemented proper dependency injection (no package-level variables)
- Added context.Context to all blocking operations
- Used value semantics where feasible (Sample, Config, Ingesters)
- Proper error wrapping with %w throughout
- All functions under 50 lines, focused and single-purpose
- Consistent ordering: constants, types, constructors, public, private
- Added -version flag to display version from internal/version package
Package structure:
- cmd/prometheus-pusher: Main entry point with flag parsing and mode routing
- internal/config: Configuration types and constants
- internal/version: Version constant (0.0.0)
- internal/metrics: Sample type and Collectors for metric generation
- internal/parser: CSV/JSON parsers with context support
- internal/ingester: Pushgateway, RemoteWrite, and Auto ingesters
All modes tested and working: realtime, historic, backfill, auto
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|
|
|
|
Unified all functionality into one binary instead of multiple variants.
## Changes
- Removed: prometheus-pusher-auto, prometheus-pusher-historic
- Single binary: prometheus-pusher (supports all modes)
- Updated all documentation to reference single binary
- Updated run.sh to use unified binary
## Usage
One binary, four modes:
```bash
# Realtime mode (default)
./prometheus-pusher -mode=realtime -continuous
# Historic mode (single datapoint)
./prometheus-pusher -mode=historic -hours-ago=24
# Backfill mode (range of datapoints)
./prometheus-pusher -mode=backfill -start-hours=48 -end-hours=0 -interval=1
# Auto mode (automatic timestamp detection)
./prometheus-pusher -mode=auto -file=data.csv
```
All features accessible from one unified tool!
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
|
|