From 0c451d4dc6509b71443645d97bf91dd9cd2e2773 Mon Sep 17 00:00:00 2001 From: Paul Buetow Date: Thu, 15 Jan 2026 21:45:31 +0200 Subject: cleanup --- f3s/ARGO-ROLLOUTS-SUMMARY.md | 248 -------------- f3s/README-ROLLOUTS.md | 226 ------------- f3s/ROLLOUTS-CHECKLIST.md | 222 ------------- f3s/ROLLOUTS-FILE-TREE.txt | 183 ----------- f3s/ROLLOUTS-SETUP.md | 373 --------------------- f3s/tracing-demo/ARGO-ROLLOUTS-SUMMARY.md | 248 ++++++++++++++ f3s/tracing-demo/README-ROLLOUTS.md | 226 +++++++++++++ f3s/tracing-demo/ROLLOUTS-CHECKLIST.md | 222 +++++++++++++ f3s/tracing-demo/ROLLOUTS-FILE-TREE.txt | 183 +++++++++++ f3s/tracing-demo/ROLLOUTS-SETUP.md | 373 +++++++++++++++++++++ f3s/wireguardroaming-plan.md | 528 ------------------------------ 11 files changed, 1252 insertions(+), 1780 deletions(-) delete mode 100644 f3s/ARGO-ROLLOUTS-SUMMARY.md delete mode 100644 f3s/README-ROLLOUTS.md delete mode 100644 f3s/ROLLOUTS-CHECKLIST.md delete mode 100644 f3s/ROLLOUTS-FILE-TREE.txt delete mode 100644 f3s/ROLLOUTS-SETUP.md create mode 100644 f3s/tracing-demo/ARGO-ROLLOUTS-SUMMARY.md create mode 100644 f3s/tracing-demo/README-ROLLOUTS.md create mode 100644 f3s/tracing-demo/ROLLOUTS-CHECKLIST.md create mode 100644 f3s/tracing-demo/ROLLOUTS-FILE-TREE.txt create mode 100644 f3s/tracing-demo/ROLLOUTS-SETUP.md delete mode 100644 f3s/wireguardroaming-plan.md diff --git a/f3s/ARGO-ROLLOUTS-SUMMARY.md b/f3s/ARGO-ROLLOUTS-SUMMARY.md deleted file mode 100644 index 80adc23..0000000 --- a/f3s/ARGO-ROLLOUTS-SUMMARY.md +++ /dev/null @@ -1,248 +0,0 @@ -# Argo Rollouts Implementation Summary - -## What Was Created - -### 1. Argo Rollouts Controller Installation -**Location**: `/home/paul/git/conf/f3s/argo-rollouts/` - -Files: -- `Justfile` - Installation automation -- `values.yaml` - Helm configuration -- `README.md` - Installation guide - -Deployment: -```bash -cd /home/paul/git/conf/f3s/argo-rollouts -just install -``` - -Also registered in ArgoCD: `/home/paul/git/conf/f3s/argocd-apps/cicd/argo-rollouts.yaml` - -### 2. Frontend Rollout Manifest -**Location**: `/home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml` - -**Replaces**: `frontend-deployment.yaml` (kept for reference) - -**Strategy**: Canary with 1-minute observation window -``` -Step 1: 33% traffic to new version (1 new pod, 3 old pods) -Step 2: Pause 1 minute (observation period) -Step 3: 100% traffic to new version (auto-promote) -``` - -**Why Frontend?** -- Has 2 replicas (good for canary demo) -- User-facing (can observe behavior easily) -- Generates traces (can monitor impact) -- Non-critical for cluster health - -### 3. Demo Documentation - -**`/home/paul/git/conf/f3s/tracing-demo/ROLLOUTS-DEMO.md`** -- Comprehensive walkthrough -- Real-time monitoring commands -- Troubleshooting guide -- Advanced patterns - -**`/home/paul/git/conf/f3s/ROLLOUTS-SETUP.md`** -- Quick setup instructions -- 5 demo scenarios (basic, manual, abort, prometheus, gitops) -- Expected output and timings -- Monitoring dashboard examples - -**`/home/paul/git/conf/f3s/tracing-demo/rollout-demo.sh`** -- Automated demo starter script -- Checks prerequisites -- Provides instructions - -### 4. Enhanced Justfile Commands -**Location**: `/home/paul/git/conf/f3s/tracing-demo/Justfile` - -New commands: -```bash -just rollout-watch # Watch progress in real-time -just rollout-status # Check current status -just rollout-info # Detailed information -just rollout-promote # Skip waiting, promote to 100% -just rollout-abort # Abort current rollout -just rollout-history # View past rollouts -just rollout-demo # Start demo script -``` - -### 5. Updated ArgoCD Application -**Location**: `/home/paul/git/conf/f3s/argocd-apps/services/tracing-demo.yaml` - -Added sync option: `RespectIgnoreDifferences=true` to gracefully handle migration from Deployment to Rollout. - -## Architecture - -``` -┌─────────────────────────────────────────┐ -│ Kubernetes Cluster │ -├─────────────────────────────────────────┤ -│ │ -│ ┌──────────────────┐ │ -│ │ ArgoCD (cicd) │ │ -│ └────────┬─────────┘ │ -│ │ │ -│ └──→ Git Repository │ -│ (conf.git) │ -│ │ -│ ┌──────────────────────────────────┐ │ -│ │ Argo Rollouts Controller (cicd) │ │ -│ │ - Manages Rollout resources │ │ -│ │ - Orchestrates canary │ │ -│ │ - Monitors replica sets │ │ -│ └──────────────────────────────────┘ │ -│ ▲ │ -│ │ watches │ -│ │ │ -│ ┌────────────────────────────────────┐ │ -│ │ tracing-demo-frontend Rollout │ │ -│ │ ┌──────────────┐ ┌──────────────┐│ │ -│ │ │ Stable RS │ │ Canary RS ││ │ -│ │ │ 3 replicas │ │ 1 replica ││ │ -│ │ └──────────────┘ └──────────────┘│ │ -│ │ │ │ -│ │ Endpoints: frontend-service │ │ -│ │ - Selects both RS (proportional) │ │ -│ │ - Routes traffic to 67%/33% │ │ -│ └────────────────────────────────────┘ │ -│ │ -│ ┌──────────────────┐ │ -│ │ Middleware │ ┌──────────────┐│ -│ │ Backend │ │ Deployment ││ -│ │ (unchanged) │ │ (unchanged) ││ -│ └──────────────────┘ └──────────────┘│ -│ │ -└─────────────────────────────────────────┘ - Monitoring (Prometheus/Grafana) -``` - -## Key Differences: Deployment vs Rollout - -| Aspect | Deployment | Rollout | -|--------|------------|---------| -| **Update Strategy** | RollingUpdate (all or nothing) | Canary, Blue-Green, A/B | -| **Traffic Split** | No built-in support | Native pod-level splitting | -| **Pause/Resume** | No | Yes (at canary steps) | -| **Automatic Rollback** | No (manual `rollout undo`) | Yes (if health checks fail) | -| **Visibility** | kubectl rollout status | kubectl argo rollouts get --watch | -| **Observability** | Basic pod counts | Detailed step information | - -## How It Works - -### Normal Deployment (Traditional) -``` -kubectl apply → All pods immediately scale up/down -Old pods: 2 → 0 -New pods: 0 → 2 -Users affected: ~5 seconds of traffic loss risk -``` - -### Canary Rollout (New) -``` -Git commit → ArgoCD detects → Argo Rollouts orchestrates - -Step 1 (50% traffic): - Stable: 2 pods → 1 pod (old version) - Canary: 0 pods → 1 pod (new version) - Users see: 50% old, 50% new for 0-2 minutes - -Step 2 (Pause): - Stable: 1 pod (old) - Canary: 1 pod (new) - Observe metrics, logs, error rates for 2 minutes - -Step 3 (100% traffic): - Stable: 1 → 0 pods (old version terminated) - Canary: 1 → 2 pods (new version scales up) - Users see: 100% new version - - Complete: Canary promoted to stable -``` - -## Demo Quick Start - -### 1. Install Everything -```bash -cd /home/paul/git/conf/f3s -# Sync with ArgoCD (auto or manual) -argocd app sync argo-rollouts -argocd app sync tracing-demo -``` - -### 2. Verify Setup -```bash -cd /home/paul/git/conf/f3s/tracing-demo -just rollout-status -# Should show: Rollout is healthy -``` - -### 3. Run Demo -```bash -# Terminal 1: Watch rollout -just rollout-watch - -# Terminal 2: Trigger rollout (modify git or patch) -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"registry.lan.buetow.org:30001/tracing-demo-frontend:latest"}]' -``` - -### 4. Observe -- See canary step progress in Terminal 1 -- Optional: `just load-test` to generate traffic during rollout -- After ~4 minutes: Rollout complete, 100% traffic to new version - -## Files Summary - -| Path | Purpose | -|------|---------| -| `argo-rollouts/Justfile` | Install/upgrade/check Argo Rollouts | -| `argo-rollouts/values.yaml` | Helm configuration for controller | -| `argo-rollouts/README.md` | Installation and basic usage | -| `tracing-demo/helm-chart/templates/frontend-rollout.yaml` | Canary rollout definition | -| `tracing-demo/Justfile` | Added `just rollout-*` commands | -| `tracing-demo/ROLLOUTS-DEMO.md` | Detailed walkthrough | -| `tracing-demo/rollout-demo.sh` | Demo starter script | -| `argocd-apps/cicd/argo-rollouts.yaml` | ArgoCD Application for controller | -| `argocd-apps/services/tracing-demo.yaml` | Updated to work with Rollout | -| `ROLLOUTS-SETUP.md` | Complete setup guide with scenarios | -| `ARGO-ROLLOUTS-SUMMARY.md` | This file | - -## Next Steps - -1. **Install controller**: `cd argo-rollouts && just install` -2. **Wait for ArgoCD sync** or manually sync `argo-rollouts` and `tracing-demo` apps -3. **Verify**: `just rollout-status` shows healthy -4. **Run demo**: `just rollout-watch` + trigger in another terminal -5. **Explore**: Try abort, promote, or different canary durations - -## Important Notes - -- **No service mesh required**: Uses native Kubernetes service-based routing -- **Traffic splitting**: Proportional to pod counts (1 old, 1 new = 50/50) -- **Auto-promotion**: After 2 minutes, canary automatically promotes to 100% -- **Graceful**: ArgoCD correctly handles transition from Deployment → Rollout -- **Reversible**: Can abort and keep old version running - -## Limitations & Future Work - -**Current (Basic Canary)**: -- Simple replica-based traffic splitting -- No header-based routing -- No advanced health checks - -**To Add** (Optional): -- **Istio integration**: For precise % traffic splitting, header-based routing -- **Flagger**: Automated canary analysis with Prometheus thresholds -- **Linkerd**: For distributed tracing and observability -- **Longer observation**: Change `pause: duration: 2m` to `5m` or `10m` - -## Questions? - -See: -- `/home/paul/git/conf/f3s/ROLLOUTS-SETUP.md` - Complete setup & scenarios -- `/home/paul/git/conf/f3s/tracing-demo/ROLLOUTS-DEMO.md` - Detailed walkthrough -- `/home/paul/git/conf/f3s/argo-rollouts/README.md` - Controller-specific info diff --git a/f3s/README-ROLLOUTS.md b/f3s/README-ROLLOUTS.md deleted file mode 100644 index b038bf9..0000000 --- a/f3s/README-ROLLOUTS.md +++ /dev/null @@ -1,226 +0,0 @@ -# Argo Rollouts - Quick Reference - -Progressive delivery (canary deployments) for the f3s cluster. - -## TL;DR - Get Started in 5 Minutes - -```bash -# 1. Install controller -cd /home/paul/git/conf/f3s/argo-rollouts -just install - -# 2. Wait for ArgoCD sync (or force) -argocd app sync argo-rollouts -argocd app sync tracing-demo - -# 3. Verify setup -cd /home/paul/git/conf/f3s/tracing-demo -just rollout-status - -# 4. Run a demo (Terminal 1) -just rollout-watch - -# 5. Trigger in another terminal (Terminal 2) -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' - -# 6. Watch progress in Terminal 1 (~90 seconds total) -``` - -Expected flow: -- 0-15 sec: **33% traffic** to canary (1 new pod, 3 old pods) -- 15-60 sec: **Monitor** (paused, observing canary health) -- 60+ sec: **Auto-promote to 100%** (scales all 3 pods to new version) -- ~90 sec: **Complete** (all 3 pods running new version) - -## Files Created - -### Setup & Installation -- `argo-rollouts/Justfile` - Install/manage controller -- `argo-rollouts/values.yaml` - Helm config -- `argocd-apps/cicd/argo-rollouts.yaml` - ArgoCD app - -### Demo App Configuration -- `tracing-demo/helm-chart/templates/frontend-rollout.yaml` - Canary definition -- `tracing-demo/Justfile` - New `just rollout-*` commands -- `tracing-demo/rollout-demo.sh` - Demo automation script - -### Documentation -- `ARGO-ROLLOUTS-SUMMARY.md` - **START HERE** - Full overview -- `ROLLOUTS-SETUP.md` - **DETAILED GUIDE** - 5 demo scenarios -- `ROLLOUTS-CHECKLIST.md` - **DEPLOYMENT CHECKLIST** - Step-by-step -- `tracing-demo/ROLLOUTS-DEMO.md` - Technical walkthrough -- `README-ROLLOUTS.md` - This file - -## Why Canary Deployments? - -**Old way (Deployment)**: -- 2 old pods → removed -- 2 new pods → created -- ~5 seconds of potential traffic loss -- No way to validate before 100% rollout - -**New way (Rollout with Canary)**: -- 3 old pods → 3 old + 1 new (33% traffic to canary) -- Observe for 1 minute -- If healthy → automatically promote all 3 pods to new version -- If unhealthy → abort, revert to 3 old pods -- Zero downtime, validated before full rollout - -## Common Commands - -```bash -cd /home/paul/git/conf/f3s/tracing-demo - -# Watch rollout progress (real-time) -just rollout-watch - -# Check current status -just rollout-status - -# Detailed info -just rollout-info - -# Abort and rollback (prevents auto-promotion) -just rollout-abort - -# View history -just rollout-history - -# Generate load during rollout -just load-test -``` - -## What Happens During Canary - -### Step 1: 33% Traffic (0-15 seconds) -``` -Frontend Service -├── Stable ReplicaSet (old version): 3 pods → receives 67% traffic -└── Canary ReplicaSet (new version): 1 pod → receives 33% traffic -``` - -Monitor during this phase: -- Error rates -- Response latency -- Logs and traces -- Prometheus metrics - -### Step 2: Pause (15-60 seconds) -``` -Service pauses traffic shift, monitoring canary health: -- Auto-promotion after 1 minute if healthy -- Or abort: kubectl argo rollouts abort ... to stop -``` - -### Step 3: 100% Traffic (60+ seconds) -``` -Frontend Service -├── Stable ReplicaSet (new version): 3 pods → receives 100% traffic -└── Canary ReplicaSet (old version): 0 pods → terminated -``` - -## Architecture - -``` -Git Commit (new image) - ↓ -Git Server (conf.git) - ↓ -ArgoCD detects change - ↓ -Updates Rollout resource - ↓ -Argo Rollouts Controller - ↓ - ├─→ Scales Canary ReplicaSet (1 new pod) - ├─→ Frontend Service routes 33/67 traffic - ├─→ Monitors health/metrics for 1 minute - └─→ Auto-promotes if healthy - ├─→ If healthy: Scale to 3 new, remove old - └─→ If abort: Remove canary, keep 3 old -``` - -## Demo Scenarios - -See `ROLLOUTS-SETUP.md` for complete walkthrough of: - -1. **Basic Canary** - Watch 50% → 100% progression -2. **Manual Promotion** - Skip waiting with `just rollout-promote` -3. **Abort/Rollback** - Fail canary and revert -4. **Prometheus Monitoring** - Track metrics during rollout -5. **GitOps Flow** - Commit code, watch auto-rollout - -## Monitoring - -### Command-line -```bash -# Real-time watch -kubectl argo rollouts get rollout tracing-demo-frontend -n services --watch - -# Check metrics -kubectl top pods -n services -l app=tracing-demo-frontend -``` - -### Grafana -https://grafana.f3s.buetow.org - -1. Explore → Tempo -2. Query: `{ resource.service.name = "frontend" }` -3. See traces from old and new versions - -### Prometheus -```bash -# Port-forward -kubectl port-forward -n monitoring svc/prometheus 9090:9090 -# Open http://localhost:9090 - -# Query pod status -kube_pod_status_phase{namespace="services", pod=~".*frontend.*"} -``` - -## Troubleshooting - -**Controller not running?** -```bash -kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts -kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts -``` - -**Rollout stuck?** -```bash -kubectl describe rollout tracing-demo-frontend -n services -kubectl get pods -n services -l app=tracing-demo-frontend -``` - -**Need plugin?** -```bash -curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 -sudo install -m 755 kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts -``` - -## Next Steps - -1. Complete setup using `ROLLOUTS-CHECKLIST.md` -2. Run demo scenarios from `ROLLOUTS-SETUP.md` -3. Share with team -4. Optional: Add Istio for advanced traffic routing -5. Optional: Deploy Flagger for automated analysis -6. Migrate other services to Rollout - -## Key Resources - -| File | Purpose | -|------|---------| -| `ARGO-ROLLOUTS-SUMMARY.md` | Architecture & what was created | -| `ROLLOUTS-SETUP.md` | Complete setup & 5 demo scenarios | -| `ROLLOUTS-CHECKLIST.md` | Step-by-step deployment | -| `tracing-demo/ROLLOUTS-DEMO.md` | Technical details & troubleshooting | -| `argo-rollouts/README.md` | Controller installation guide | - -## Support - -- Argo Rollouts Docs: https://argoproj.github.io/argo-rollouts/ -- Canary Strategy: https://argoproj.github.io/argo-rollouts/features/canary/ -- Kubectl Plugin: https://argoproj.github.io/argo-rollouts/getting-started/#using-kubectl-with-argo-rollouts diff --git a/f3s/ROLLOUTS-CHECKLIST.md b/f3s/ROLLOUTS-CHECKLIST.md deleted file mode 100644 index b475f2d..0000000 --- a/f3s/ROLLOUTS-CHECKLIST.md +++ /dev/null @@ -1,222 +0,0 @@ -# Argo Rollouts Deployment Checklist - -Quick checklist for deploying and testing Argo Rollouts with canary demo. - -## Installation - -- [ ] Read `ARGO-ROLLOUTS-SUMMARY.md` - understand what was created -- [ ] Ensure kubectl access to f3s cluster -- [ ] Ensure ArgoCD is running -- [ ] Navigate to `/home/paul/git/conf/f3s/argo-rollouts` -- [ ] Run `just install` -- [ ] Verify controller: `kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts` -- [ ] Verify CRD: `kubectl get crd | grep rollout` -- [ ] (Optional) Install plugin: - ```bash - curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 - chmod +x kubectl-argo-rollouts-linux-amd64 - sudo install -m 755 kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts - kubectl argo rollouts version - ``` - -## ArgoCD Integration - -- [ ] Push changes to git-server: - ```bash - cd /home/paul/git/conf/f3s - git add -A && git commit -m "feat: add Argo Rollouts" - git push r0 master - ``` -- [ ] Verify ArgoCD app: - ```bash - kubectl get application argo-rollouts -n cicd - argocd app get argo-rollouts - ``` -- [ ] Verify tracing-demo app: - ```bash - kubectl get application tracing-demo -n cicd - argocd app get tracing-demo - ``` - -## Rollout Verification - -- [ ] Check rollout exists: `kubectl get rollout tracing-demo-frontend -n services` -- [ ] Verify status: `kubectl describe rollout tracing-demo-frontend -n services` -- [ ] Expected: `Status: Healthy` with `3/3 replicas` in stable state -- [ ] Check pods: `kubectl get pods -n services -l app=tracing-demo-frontend` -- [ ] All 3 pods should be `Running` - -## Demo: Basic Canary Rollout - -**Expected: 0-15s: canary starting, 15-60s: observing, 60-90s: promoting** - -### Terminal 1: Watch Rollout -```bash -cd /home/paul/git/conf/f3s/tracing-demo -just rollout-watch -``` -- [ ] Command runs and connects to cluster -- [ ] Waiting for rollout to start - -### Terminal 2: Trigger Rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` -- [ ] Patch command successful -- [ ] Terminal 1 shows change immediately - -### Terminal 1: Observe Progress -- [ ] See `Step: 0/3, SetWeight: 33` -- [ ] 1 canary pod becoming ready -- [ ] 3 stable pods still running -- [ ] After ~15 sec: canary pod ready -- [ ] After ~60 sec: auto-promotion starts -- [ ] After ~90 sec: all 3 pods running new version -- [ ] Status shows `Healthy` - -## Demo: Abort/Rollback - -**Expected: Stop rollout and keep old version running** - -### Terminal 1: Watch Rollout -```bash -just rollout-watch -``` - -### Terminal 2: Trigger Rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V2","value":"'$(date +%s)'"}}]' -``` - -### Terminal 3: Abort at Canary Step (after 20 seconds) -```bash -cd /home/paul/git/conf/f3s/tracing-demo -just rollout-abort -``` -- [ ] Abort command accepted -- [ ] Terminal 1 shows `Status: Aborted` -- [ ] Canary pods terminate -- [ ] Old 3 pods continue running -- [ ] Verify with: `just rollout-status` - -## Demo: Load Testing - -**Expected: Generate traffic while rollout happens** - -### Terminal 1: Watch Rollout -```bash -just rollout-watch -``` - -### Terminal 2: Start Load Test -```bash -just load-test & -``` -- [ ] Requests being sent - -### Terminal 3: Trigger Rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V3","value":"'$(date +%s)'"}}]' -``` -- [ ] Rollout progresses with active traffic -- [ ] Both old and new pods serve requests during canary phase - -## Monitoring - -- [ ] Check status: `kubectl argo rollouts status tracing-demo-frontend -n services` -- [ ] Detailed info: `kubectl argo rollouts describe rollout tracing-demo-frontend -n services` -- [ ] Pod details: `kubectl get pods -n services -l app=tracing-demo-frontend -o wide` -- [ ] View logs: `just logs-frontend` -- [ ] View history: `just rollout-history` - -## Grafana (Optional) - -- [ ] Open Grafana: https://grafana.f3s.buetow.org -- [ ] Navigate to Explore → Tempo datasource -- [ ] Query: `{ resource.service.name = "frontend" }` -- [ ] See traces from old and new versions during rollout - -## Integration with Git (GitOps) - -- [ ] Edit rollout config: - ```bash - nano /home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml - ``` -- [ ] Change any settings (e.g., duration, setWeight) -- [ ] Commit and push: - ```bash - git add -A && git commit -m "chore: adjust canary settings" - git push r0 master - ``` -- [ ] ArgoCD auto-syncs within 3 minutes (or force): - ```bash - kubectl annotate application tracing-demo -n cicd argocd.argoproj.io/refresh=normal --overwrite - ``` -- [ ] New settings take effect on next rollout trigger - -## Post-Demo - -- [ ] Abort any stuck rollouts: `just rollout-abort` -- [ ] Verify stable state: `just rollout-status` shows `Healthy` -- [ ] Review documentation: - - [ ] `ARGO-ROLLOUTS-SUMMARY.md` - architecture - - [ ] `ROLLOUTS-SETUP.md` - detailed scenarios - - [ ] `README-ROLLOUTS.md` - quick reference - - [ ] `tracing-demo/ROLLOUTS-DEMO.md` - technical details - -## Troubleshooting - -### Controller not running -```bash -kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts -kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts -``` -- [ ] Pod running and ready - -### Rollout not deployed -```bash -kubectl get rollout tracing-demo-frontend -n services -kubectl describe rollout tracing-demo-frontend -n services -``` -- [ ] Check events section for errors - -### Canary pods in ImagePullBackoff -- [ ] Use env var patch instead (don't change image tag): - ```bash - kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' - ``` - -### Rollout stuck in Progressing -```bash -kubectl describe rollout tracing-demo-frontend -n services -kubectl get pods -n services -l app=tracing-demo-frontend -``` -- [ ] Check pod readiness probes -- [ ] Check pod resource requests/limits -- [ ] Check controller logs - -## Next Steps - -- [ ] Run through all demo scenarios multiple times -- [ ] Modify rollout settings and observe behavior -- [ ] Monitor with Prometheus/Grafana -- [ ] Extend to other services (middleware, backend) -- [ ] Optional: Install Istio for advanced traffic routing -- [ ] Optional: Deploy Flagger for automated analysis - ---- - -**Setup Complete When:** -- ✅ Controller running in `cicd` namespace -- ✅ Rollout deployed in `services` namespace -- ✅ One full demo executed (0-90 seconds) -- ✅ Can abort and retry -- ✅ Team trained on canary deployments diff --git a/f3s/ROLLOUTS-FILE-TREE.txt b/f3s/ROLLOUTS-FILE-TREE.txt deleted file mode 100644 index 6c85754..0000000 --- a/f3s/ROLLOUTS-FILE-TREE.txt +++ /dev/null @@ -1,183 +0,0 @@ -/home/paul/git/conf/f3s/ -├── README-ROLLOUTS.md ← ENTRY POINT (quick reference) -├── ARGO-ROLLOUTS-SUMMARY.md ← Full architecture & overview -├── ROLLOUTS-SETUP.md ← Detailed setup + 5 scenarios -├── ROLLOUTS-CHECKLIST.md ← Step-by-step deployment -├── ROLLOUTS-FILE-TREE.txt ← This file -│ -├── argo-rollouts/ ← NEW: Argo Rollouts Controller -│ ├── Justfile ← Install/upgrade/uninstall -│ ├── values.yaml ← Helm configuration -│ └── README.md ← Controller-specific guide -│ -├── argocd-apps/ -│ ├── cicd/ -│ │ ├── git-server.yaml -│ │ └── argo-rollouts.yaml ← NEW: Controller app -│ │ -│ └── services/ -│ ├── tracing-demo.yaml ← UPDATED: Deployment → Rollout -│ └── ... (other apps) -│ -├── tracing-demo/ -│ ├── README.md -│ ├── Justfile ← UPDATED: Added rollout commands -│ ├── ROLLOUTS-DEMO.md ← NEW: Technical walkthrough -│ ├── rollout-demo.sh ← NEW: Demo automation -│ │ -│ └── helm-chart/ -│ ├── Chart.yaml -│ └── templates/ -│ ├── frontend-rollout.yaml ← NEW: Canary rollout definition -│ ├── frontend-deployment.yaml ← KEPT: For reference -│ ├── middleware-deployment.yaml ← (unchanged) -│ ├── backend-deployment.yaml ← (unchanged) -│ ├── frontend-service.yaml -│ ├── middleware-service.yaml -│ ├── backend-service.yaml -│ └── ingress.yaml -│ -└── ... (other apps unchanged) - - -═══════════════════════════════════════════════════════════════════════════ - -INSTALLATION SUMMARY -═══════════════════════════════════════════════════════════════════════════ - -Step 1: Install Controller - cd /home/paul/git/conf/f3s/argo-rollouts - just install - -Step 2: Verify ArgoCD - argocd app sync argo-rollouts - argocd app sync tracing-demo - -Step 3: Watch Demo - cd /home/paul/git/conf/f3s/tracing-demo - just rollout-watch - -Step 4: Trigger Rollout (in another terminal) - kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"registry.lan.buetow.org:30001/tracing-demo-frontend:latest"}]' - -═══════════════════════════════════════════════════════════════════════════ - -DOCUMENTATION ROADMAP -═══════════════════════════════════════════════════════════════════════════ - -NEW TO ARGO ROLLOUTS? - 1. Read: README-ROLLOUTS.md (3 min) - 2. Read: ARGO-ROLLOUTS-SUMMARY.md (10 min) - 3. Follow: ROLLOUTS-CHECKLIST.md (step-by-step) - -WANT DETAILED GUIDE? - → ROLLOUTS-SETUP.md - - Complete setup instructions - - 5 demo scenarios with expected output - - Monitoring dashboards - - Advanced patterns - -DOING THE DEPLOYMENT? - → ROLLOUTS-CHECKLIST.md - - Pre-deployment checks - - Installation steps - - Verification - - Troubleshooting - -TROUBLESHOOTING? - → ROLLOUTS-SETUP.md → Troubleshooting section - → argo-rollouts/README.md - → tracing-demo/ROLLOUTS-DEMO.md - -═══════════════════════════════════════════════════════════════════════════ - -KEY FILES EXPLAINED -═══════════════════════════════════════════════════════════════════════════ - -argo-rollouts/Justfile - - Automates installation of Argo Rollouts controller - - Commands: install, upgrade, uninstall, status, logs - - Deploys to: cicd namespace - -argo-rollouts/values.yaml - - Helm chart configuration for Argo Rollouts - - Sets resource limits, metrics, replicas - -argocd-apps/cicd/argo-rollouts.yaml - - ArgoCD Application resource - - Manages controller installation via GitOps - - Auto-syncs when argo-rollouts/ changes in git - -tracing-demo/helm-chart/templates/frontend-rollout.yaml - - Replaces frontend-deployment.yaml - - Defines canary strategy: - * Step 1: 50% traffic - * Step 2: 2-minute pause - * Step 3: 100% promotion - - Keeps same pods, volumes, env vars as Deployment - -tracing-demo/Justfile (updated) - - New commands for rollout management - - just rollout-watch - - just rollout-status - - just rollout-promote - - just rollout-abort - - just rollout-history - -tracing-demo/rollout-demo.sh - - Automation script for demo - - Checks prerequisites - - Guides through demo workflow - - Can be extended for CI/CD - -═══════════════════════════════════════════════════════════════════════════ - -WHAT CHANGED IN EXISTING FILES -═══════════════════════════════════════════════════════════════════════════ - -tracing-demo/Justfile - [+] 8 new rollout commands - [-] No breaking changes to existing commands - -tracing-demo/helm-chart/templates/frontend-deployment.yaml - [~] Still exists (for reference, not deployed) - [→] Replaced by frontend-rollout.yaml in deployment - -argocd-apps/services/tracing-demo.yaml - [+] RespectIgnoreDifferences=true sync option - [-] No other changes (points to same Helm chart) - -═══════════════════════════════════════════════════════════════════════════ - -WHAT DID NOT CHANGE -═══════════════════════════════════════════════════════════════════════════ - -✓ Middleware & Backend services remain Deployments -✓ All service definitions (frontend, middleware, backend services) -✓ Ingress configuration -✓ All other apps (audiobookshelf, miniflux, etc.) -✓ ArgoCD configuration & installation -✓ Prometheus/Grafana setup - -═══════════════════════════════════════════════════════════════════════════ - -HOW TO NAVIGATE THIS -═══════════════════════════════════════════════════════════════════════════ - -If you want to... See... -──────────────────────────────────────────────────────────────────────────── -Understand what was created ARGO-ROLLOUTS-SUMMARY.md -Get started quickly README-ROLLOUTS.md -Deploy step-by-step ROLLOUTS-CHECKLIST.md -See detailed scenarios & examples ROLLOUTS-SETUP.md -Troubleshoot issues ROLLOUTS-SETUP.md (Troubleshooting section) -Learn technical details tracing-demo/ROLLOUTS-DEMO.md -Install the controller argo-rollouts/Justfile + argo-rollouts/README.md -See the rollout definition tracing-demo/helm-chart/templates/frontend-rollout.yaml -Run a demo tracing-demo/rollout-demo.sh or just rollout-watch -Monitor during rollout Prometheus/Grafana (see ROLLOUTS-SETUP.md) -Integrate with CI/CD See ROLLOUTS-SETUP.md section "GitOps Flow" - -═══════════════════════════════════════════════════════════════════════════ diff --git a/f3s/ROLLOUTS-SETUP.md b/f3s/ROLLOUTS-SETUP.md deleted file mode 100644 index 0ea965c..0000000 --- a/f3s/ROLLOUTS-SETUP.md +++ /dev/null @@ -1,373 +0,0 @@ -# Argo Rollouts Setup and Demo Guide - -Complete setup and demonstration of Argo Rollouts with the tracing-demo application. Canary strategy: 33% traffic (1 pod) for 1 minute, then auto-promote to 100%. - -## Quick Setup - -### 1. Install Argo Rollouts Controller - -```bash -cd /home/paul/git/conf/f3s/argo-rollouts -just install -``` - -Verify installation: -```bash -kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts -kubectl get crd | grep rollout -``` - -### 2. Install kubectl Plugin (Optional but Recommended) - -```bash -curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 -chmod +x kubectl-argo-rollouts-linux-amd64 -sudo install -m 755 kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts -``` - -Verify: -```bash -kubectl argo rollouts version -``` - -### 3. Sync ArgoCD with New Applications - -```bash -argocd app sync argo-rollouts -argocd app sync tracing-demo -``` - -### 4. Verify Rollout is Deployed - -```bash -kubectl get rollout tracing-demo-frontend -n services -kubectl describe rollout tracing-demo-frontend -n services -``` - -Expected status: `Healthy` with `3/3 replicas` in stable state. - -## Quick Demo (90 seconds) - -### Terminal 1 - Watch Progress - -```bash -cd /home/paul/git/conf/f3s/tracing-demo -just rollout-watch -``` - -Or use the kubectl command directly: -```bash -kubectl argo rollouts get rollout tracing-demo-frontend -n services --watch -``` - -### Terminal 2 - Trigger Rollout - -Wait 10 seconds for Terminal 1 to start watching, then trigger: - -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` - -### Watch the Timeline - -**Terminal 1 will show:** - -``` -Step: 0/3 -SetWeight: 33 -Canary: 1 pod (new version) - starting -Stable: 3 pods (old version) - handling requests -``` - -→ After 15 seconds, canary pod becomes ready: - -``` -Step: 1/3 -SetWeight: 33 -Canary: 1 pod (new version) - ready, receiving 33% traffic -Stable: 3 pods (old version) - receiving 67% traffic -``` - -→ After ~60 seconds, auto-promotion begins: - -``` -Step: 2/3 -SetWeight: 100 -Canary scaling → Stable -``` - -→ After ~90 seconds, complete: - -``` -Status: Healthy -Replicas: 3/3 all running new version -``` - -## Demo Scenarios - -### Scenario 1: Observe the Full Rollout - -Just follow the "Quick Demo" above. Watch all three steps progress automatically over 90 seconds. - -### Scenario 2: Abort Rollout (Simulate Failure) - -**Terminal 1**: Watch the rollout -```bash -just rollout-watch -``` - -**Terminal 2**: Trigger rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` - -**Terminal 3 (while at step 1)**: Abort the rollout -```bash -cd /home/paul/git/conf/f3s/tracing-demo -just rollout-abort -``` - -Result: -- Canary pods terminate -- Old 3 pods continue running -- Status shows "Aborted" - -Verify: -```bash -just rollout-status -``` - -### Scenario 3: Load Testing During Rollout - -**Terminal 1**: Watch rollout -```bash -just rollout-watch -``` - -**Terminal 2**: Start load test -```bash -just load-test & -``` - -**Terminal 3**: Trigger rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` - -Load test will hit both old and new pods during the 1-minute canary window. - -### Scenario 4: Check Logs During Rollout - -**Terminal 1**: Watch rollout -```bash -just rollout-watch -``` - -**Terminal 2**: Trigger rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` - -**Terminal 3**: Watch logs -```bash -kubectl logs -n services -l app=tracing-demo-frontend -f --tail=20 -``` - -See logs from both old and new pods. - -### Scenario 5: Monitor via Grafana Tempo (Distributed Tracing) - -**Terminal 1**: Watch rollout -```bash -just rollout-watch -``` - -**Terminal 2**: Trigger rollout -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` - -**Terminal 3**: Open Grafana -1. Navigate to https://grafana.f3s.buetow.org -2. Go to Explore → Select "Tempo" datasource -3. Query: `{ resource.service.name = "frontend" }` -4. See traces from both old and new versions during canary phase - -## Timeline Breakdown - -| Time | Event | Status | -|------|-------|--------| -| 0s | Trigger rollout | Rollout starts | -| 0-5s | Canary pod created | `Step 0/3: SetWeight 33` | -| 5-15s | Canary pod becoming ready | Still not ready | -| 15s | Canary pod ready | `Step 1/3: SetWeight 33, canary ready` | -| 15-60s | Observing canary | Requests split 67/33 (old/new) | -| 60s | Auto-promotion triggered | `Step 2/3: SetWeight 100` | -| 60-70s | Scaling new pods | Canary → Stable | -| 70-80s | Terminating old pods | Old pods scaling down | -| ~90s | Complete | `Status: Healthy, 3/3 replicas` | - -## Monitoring During Rollout - -### kubectl Commands - -Real-time status: -```bash -kubectl argo rollouts get rollout tracing-demo-frontend -n services --watch -``` - -Check specific details: -```bash -kubectl argo rollouts describe rollout tracing-demo-frontend -n services -kubectl argo rollouts history tracing-demo-frontend -n services -``` - -Pod status: -```bash -kubectl get pods -n services -l app=tracing-demo-frontend -o wide -``` - -### Prometheus Metrics - -```bash -# Port-forward Prometheus -kubectl port-forward -n monitoring svc/prometheus 9090:9090 -``` - -Then query: -```promql -# Pod counts during rollout -kube_replicaset_replicas{replicaset=~"tracing-demo-frontend.*"} - -# Pod status -kube_pod_status_phase{namespace="services", pod=~"tracing-demo-frontend.*"} - -# Pod age (shows which are old vs new) -time() - kube_pod_created{namespace="services", pod=~"tracing-demo-frontend.*"} -``` - -### Grafana Dashboards - -1. Open Grafana: https://grafana.f3s.buetow.org -2. Explore → Tempo datasource -3. Query: `{ resource.service.name = "frontend" }` -4. See traces from old and new versions -5. Notice latency/error differences during rollout - -## Rollout Configuration - -Located in: `/home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml` - -Key settings: -```yaml -replicas: 3 # 3 pods total -strategy: - canary: - steps: - - setWeight: 33 # Send 1 pod (33%) to canary - - pause: - duration: 1m # Wait 1 minute, then auto-promote - - setWeight: 100 # Promote all to new version -``` - -To modify pause duration: -```bash -# Edit the file -nano /home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml - -# Change duration: 1m to duration: 5m (for example) -# Then commit and push -git add -A && git commit -m "chore: extend canary pause to 5 minutes" -git push r0 master -``` - -ArgoCD will auto-sync the new rollout configuration. - -## Troubleshooting - -### Rollout shows "ErrImagePull" on canary pod - -This happens if using an image tag that doesn't exist. The env var patch approach forces a rollout without changing the image, so use: - -```bash -kubectl patch rollout tracing-demo-frontend -n services \ - --type='json' \ - -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' -``` - -### Rollout stuck in "Progressing" - -Check pod status: -```bash -kubectl describe rollout tracing-demo-frontend -n services -kubectl get pods -n services -l app=tracing-demo-frontend -``` - -Check controller logs: -```bash -kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts --tail=50 -``` - -### Controller not running - -```bash -kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts -kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts -``` - -### Auto-promotion not happening - -Verify pause duration is set: -```bash -kubectl get rollout tracing-demo-frontend -n services -o yaml | grep -A 5 "pause:" -``` - -## Advanced: Modify Canary Parameters - -### Increase observation time to 5 minutes - -```bash -# Edit rollout YAML -nano /home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml - -# Change: -# - pause: -# duration: 1m -# To: -# - pause: -# duration: 5m - -git add -A && git commit -m "chore: extend canary pause to 5 minutes" -git push r0 master -``` - -### Reduce traffic weight to canary (more conservative) - -```yaml -steps: -- setWeight: 10 # Only 10% traffic (0.3 pods worth) -- pause: - duration: 2m # Observe longer -- setWeight: 100 -``` - -### Add health check analysis (requires Flagger or ArgoCD Analysis) - -For automated rollback based on error rate thresholds, see `/home/paul/git/conf/f3s/ROLLOUTS-SETUP.md` → "Advanced: Custom Analysis" section. - -## References - -- [Argo Rollouts Canary Strategy](https://argoproj.github.io/argo-rollouts/features/canary/) -- [Argo Rollouts Best Practices](https://argoproj.github.io/argo-rollouts/best-practices/) -- [kubectl-argo-rollouts Plugin](https://argoproj.github.io/argo-rollouts/getting-started/#using-kubectl-with-argo-rollouts) -- [Flagger for Automated Analysis](https://flagger.app/) diff --git a/f3s/tracing-demo/ARGO-ROLLOUTS-SUMMARY.md b/f3s/tracing-demo/ARGO-ROLLOUTS-SUMMARY.md new file mode 100644 index 0000000..80adc23 --- /dev/null +++ b/f3s/tracing-demo/ARGO-ROLLOUTS-SUMMARY.md @@ -0,0 +1,248 @@ +# Argo Rollouts Implementation Summary + +## What Was Created + +### 1. Argo Rollouts Controller Installation +**Location**: `/home/paul/git/conf/f3s/argo-rollouts/` + +Files: +- `Justfile` - Installation automation +- `values.yaml` - Helm configuration +- `README.md` - Installation guide + +Deployment: +```bash +cd /home/paul/git/conf/f3s/argo-rollouts +just install +``` + +Also registered in ArgoCD: `/home/paul/git/conf/f3s/argocd-apps/cicd/argo-rollouts.yaml` + +### 2. Frontend Rollout Manifest +**Location**: `/home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml` + +**Replaces**: `frontend-deployment.yaml` (kept for reference) + +**Strategy**: Canary with 1-minute observation window +``` +Step 1: 33% traffic to new version (1 new pod, 3 old pods) +Step 2: Pause 1 minute (observation period) +Step 3: 100% traffic to new version (auto-promote) +``` + +**Why Frontend?** +- Has 2 replicas (good for canary demo) +- User-facing (can observe behavior easily) +- Generates traces (can monitor impact) +- Non-critical for cluster health + +### 3. Demo Documentation + +**`/home/paul/git/conf/f3s/tracing-demo/ROLLOUTS-DEMO.md`** +- Comprehensive walkthrough +- Real-time monitoring commands +- Troubleshooting guide +- Advanced patterns + +**`/home/paul/git/conf/f3s/ROLLOUTS-SETUP.md`** +- Quick setup instructions +- 5 demo scenarios (basic, manual, abort, prometheus, gitops) +- Expected output and timings +- Monitoring dashboard examples + +**`/home/paul/git/conf/f3s/tracing-demo/rollout-demo.sh`** +- Automated demo starter script +- Checks prerequisites +- Provides instructions + +### 4. Enhanced Justfile Commands +**Location**: `/home/paul/git/conf/f3s/tracing-demo/Justfile` + +New commands: +```bash +just rollout-watch # Watch progress in real-time +just rollout-status # Check current status +just rollout-info # Detailed information +just rollout-promote # Skip waiting, promote to 100% +just rollout-abort # Abort current rollout +just rollout-history # View past rollouts +just rollout-demo # Start demo script +``` + +### 5. Updated ArgoCD Application +**Location**: `/home/paul/git/conf/f3s/argocd-apps/services/tracing-demo.yaml` + +Added sync option: `RespectIgnoreDifferences=true` to gracefully handle migration from Deployment to Rollout. + +## Architecture + +``` +┌─────────────────────────────────────────┐ +│ Kubernetes Cluster │ +├─────────────────────────────────────────┤ +│ │ +│ ┌──────────────────┐ │ +│ │ ArgoCD (cicd) │ │ +│ └────────┬─────────┘ │ +│ │ │ +│ └──→ Git Repository │ +│ (conf.git) │ +│ │ +│ ┌──────────────────────────────────┐ │ +│ │ Argo Rollouts Controller (cicd) │ │ +│ │ - Manages Rollout resources │ │ +│ │ - Orchestrates canary │ │ +│ │ - Monitors replica sets │ │ +│ └──────────────────────────────────┘ │ +│ ▲ │ +│ │ watches │ +│ │ │ +│ ┌────────────────────────────────────┐ │ +│ │ tracing-demo-frontend Rollout │ │ +│ │ ┌──────────────┐ ┌──────────────┐│ │ +│ │ │ Stable RS │ │ Canary RS ││ │ +│ │ │ 3 replicas │ │ 1 replica ││ │ +│ │ └──────────────┘ └──────────────┘│ │ +│ │ │ │ +│ │ Endpoints: frontend-service │ │ +│ │ - Selects both RS (proportional) │ │ +│ │ - Routes traffic to 67%/33% │ │ +│ └────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────┐ │ +│ │ Middleware │ ┌──────────────┐│ +│ │ Backend │ │ Deployment ││ +│ │ (unchanged) │ │ (unchanged) ││ +│ └──────────────────┘ └──────────────┘│ +│ │ +└─────────────────────────────────────────┘ + Monitoring (Prometheus/Grafana) +``` + +## Key Differences: Deployment vs Rollout + +| Aspect | Deployment | Rollout | +|--------|------------|---------| +| **Update Strategy** | RollingUpdate (all or nothing) | Canary, Blue-Green, A/B | +| **Traffic Split** | No built-in support | Native pod-level splitting | +| **Pause/Resume** | No | Yes (at canary steps) | +| **Automatic Rollback** | No (manual `rollout undo`) | Yes (if health checks fail) | +| **Visibility** | kubectl rollout status | kubectl argo rollouts get --watch | +| **Observability** | Basic pod counts | Detailed step information | + +## How It Works + +### Normal Deployment (Traditional) +``` +kubectl apply → All pods immediately scale up/down +Old pods: 2 → 0 +New pods: 0 → 2 +Users affected: ~5 seconds of traffic loss risk +``` + +### Canary Rollout (New) +``` +Git commit → ArgoCD detects → Argo Rollouts orchestrates + +Step 1 (50% traffic): + Stable: 2 pods → 1 pod (old version) + Canary: 0 pods → 1 pod (new version) + Users see: 50% old, 50% new for 0-2 minutes + +Step 2 (Pause): + Stable: 1 pod (old) + Canary: 1 pod (new) + Observe metrics, logs, error rates for 2 minutes + +Step 3 (100% traffic): + Stable: 1 → 0 pods (old version terminated) + Canary: 1 → 2 pods (new version scales up) + Users see: 100% new version + + Complete: Canary promoted to stable +``` + +## Demo Quick Start + +### 1. Install Everything +```bash +cd /home/paul/git/conf/f3s +# Sync with ArgoCD (auto or manual) +argocd app sync argo-rollouts +argocd app sync tracing-demo +``` + +### 2. Verify Setup +```bash +cd /home/paul/git/conf/f3s/tracing-demo +just rollout-status +# Should show: Rollout is healthy +``` + +### 3. Run Demo +```bash +# Terminal 1: Watch rollout +just rollout-watch + +# Terminal 2: Trigger rollout (modify git or patch) +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"registry.lan.buetow.org:30001/tracing-demo-frontend:latest"}]' +``` + +### 4. Observe +- See canary step progress in Terminal 1 +- Optional: `just load-test` to generate traffic during rollout +- After ~4 minutes: Rollout complete, 100% traffic to new version + +## Files Summary + +| Path | Purpose | +|------|---------| +| `argo-rollouts/Justfile` | Install/upgrade/check Argo Rollouts | +| `argo-rollouts/values.yaml` | Helm configuration for controller | +| `argo-rollouts/README.md` | Installation and basic usage | +| `tracing-demo/helm-chart/templates/frontend-rollout.yaml` | Canary rollout definition | +| `tracing-demo/Justfile` | Added `just rollout-*` commands | +| `tracing-demo/ROLLOUTS-DEMO.md` | Detailed walkthrough | +| `tracing-demo/rollout-demo.sh` | Demo starter script | +| `argocd-apps/cicd/argo-rollouts.yaml` | ArgoCD Application for controller | +| `argocd-apps/services/tracing-demo.yaml` | Updated to work with Rollout | +| `ROLLOUTS-SETUP.md` | Complete setup guide with scenarios | +| `ARGO-ROLLOUTS-SUMMARY.md` | This file | + +## Next Steps + +1. **Install controller**: `cd argo-rollouts && just install` +2. **Wait for ArgoCD sync** or manually sync `argo-rollouts` and `tracing-demo` apps +3. **Verify**: `just rollout-status` shows healthy +4. **Run demo**: `just rollout-watch` + trigger in another terminal +5. **Explore**: Try abort, promote, or different canary durations + +## Important Notes + +- **No service mesh required**: Uses native Kubernetes service-based routing +- **Traffic splitting**: Proportional to pod counts (1 old, 1 new = 50/50) +- **Auto-promotion**: After 2 minutes, canary automatically promotes to 100% +- **Graceful**: ArgoCD correctly handles transition from Deployment → Rollout +- **Reversible**: Can abort and keep old version running + +## Limitations & Future Work + +**Current (Basic Canary)**: +- Simple replica-based traffic splitting +- No header-based routing +- No advanced health checks + +**To Add** (Optional): +- **Istio integration**: For precise % traffic splitting, header-based routing +- **Flagger**: Automated canary analysis with Prometheus thresholds +- **Linkerd**: For distributed tracing and observability +- **Longer observation**: Change `pause: duration: 2m` to `5m` or `10m` + +## Questions? + +See: +- `/home/paul/git/conf/f3s/ROLLOUTS-SETUP.md` - Complete setup & scenarios +- `/home/paul/git/conf/f3s/tracing-demo/ROLLOUTS-DEMO.md` - Detailed walkthrough +- `/home/paul/git/conf/f3s/argo-rollouts/README.md` - Controller-specific info diff --git a/f3s/tracing-demo/README-ROLLOUTS.md b/f3s/tracing-demo/README-ROLLOUTS.md new file mode 100644 index 0000000..b038bf9 --- /dev/null +++ b/f3s/tracing-demo/README-ROLLOUTS.md @@ -0,0 +1,226 @@ +# Argo Rollouts - Quick Reference + +Progressive delivery (canary deployments) for the f3s cluster. + +## TL;DR - Get Started in 5 Minutes + +```bash +# 1. Install controller +cd /home/paul/git/conf/f3s/argo-rollouts +just install + +# 2. Wait for ArgoCD sync (or force) +argocd app sync argo-rollouts +argocd app sync tracing-demo + +# 3. Verify setup +cd /home/paul/git/conf/f3s/tracing-demo +just rollout-status + +# 4. Run a demo (Terminal 1) +just rollout-watch + +# 5. Trigger in another terminal (Terminal 2) +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' + +# 6. Watch progress in Terminal 1 (~90 seconds total) +``` + +Expected flow: +- 0-15 sec: **33% traffic** to canary (1 new pod, 3 old pods) +- 15-60 sec: **Monitor** (paused, observing canary health) +- 60+ sec: **Auto-promote to 100%** (scales all 3 pods to new version) +- ~90 sec: **Complete** (all 3 pods running new version) + +## Files Created + +### Setup & Installation +- `argo-rollouts/Justfile` - Install/manage controller +- `argo-rollouts/values.yaml` - Helm config +- `argocd-apps/cicd/argo-rollouts.yaml` - ArgoCD app + +### Demo App Configuration +- `tracing-demo/helm-chart/templates/frontend-rollout.yaml` - Canary definition +- `tracing-demo/Justfile` - New `just rollout-*` commands +- `tracing-demo/rollout-demo.sh` - Demo automation script + +### Documentation +- `ARGO-ROLLOUTS-SUMMARY.md` - **START HERE** - Full overview +- `ROLLOUTS-SETUP.md` - **DETAILED GUIDE** - 5 demo scenarios +- `ROLLOUTS-CHECKLIST.md` - **DEPLOYMENT CHECKLIST** - Step-by-step +- `tracing-demo/ROLLOUTS-DEMO.md` - Technical walkthrough +- `README-ROLLOUTS.md` - This file + +## Why Canary Deployments? + +**Old way (Deployment)**: +- 2 old pods → removed +- 2 new pods → created +- ~5 seconds of potential traffic loss +- No way to validate before 100% rollout + +**New way (Rollout with Canary)**: +- 3 old pods → 3 old + 1 new (33% traffic to canary) +- Observe for 1 minute +- If healthy → automatically promote all 3 pods to new version +- If unhealthy → abort, revert to 3 old pods +- Zero downtime, validated before full rollout + +## Common Commands + +```bash +cd /home/paul/git/conf/f3s/tracing-demo + +# Watch rollout progress (real-time) +just rollout-watch + +# Check current status +just rollout-status + +# Detailed info +just rollout-info + +# Abort and rollback (prevents auto-promotion) +just rollout-abort + +# View history +just rollout-history + +# Generate load during rollout +just load-test +``` + +## What Happens During Canary + +### Step 1: 33% Traffic (0-15 seconds) +``` +Frontend Service +├── Stable ReplicaSet (old version): 3 pods → receives 67% traffic +└── Canary ReplicaSet (new version): 1 pod → receives 33% traffic +``` + +Monitor during this phase: +- Error rates +- Response latency +- Logs and traces +- Prometheus metrics + +### Step 2: Pause (15-60 seconds) +``` +Service pauses traffic shift, monitoring canary health: +- Auto-promotion after 1 minute if healthy +- Or abort: kubectl argo rollouts abort ... to stop +``` + +### Step 3: 100% Traffic (60+ seconds) +``` +Frontend Service +├── Stable ReplicaSet (new version): 3 pods → receives 100% traffic +└── Canary ReplicaSet (old version): 0 pods → terminated +``` + +## Architecture + +``` +Git Commit (new image) + ↓ +Git Server (conf.git) + ↓ +ArgoCD detects change + ↓ +Updates Rollout resource + ↓ +Argo Rollouts Controller + ↓ + ├─→ Scales Canary ReplicaSet (1 new pod) + ├─→ Frontend Service routes 33/67 traffic + ├─→ Monitors health/metrics for 1 minute + └─→ Auto-promotes if healthy + ├─→ If healthy: Scale to 3 new, remove old + └─→ If abort: Remove canary, keep 3 old +``` + +## Demo Scenarios + +See `ROLLOUTS-SETUP.md` for complete walkthrough of: + +1. **Basic Canary** - Watch 50% → 100% progression +2. **Manual Promotion** - Skip waiting with `just rollout-promote` +3. **Abort/Rollback** - Fail canary and revert +4. **Prometheus Monitoring** - Track metrics during rollout +5. **GitOps Flow** - Commit code, watch auto-rollout + +## Monitoring + +### Command-line +```bash +# Real-time watch +kubectl argo rollouts get rollout tracing-demo-frontend -n services --watch + +# Check metrics +kubectl top pods -n services -l app=tracing-demo-frontend +``` + +### Grafana +https://grafana.f3s.buetow.org + +1. Explore → Tempo +2. Query: `{ resource.service.name = "frontend" }` +3. See traces from old and new versions + +### Prometheus +```bash +# Port-forward +kubectl port-forward -n monitoring svc/prometheus 9090:9090 +# Open http://localhost:9090 + +# Query pod status +kube_pod_status_phase{namespace="services", pod=~".*frontend.*"} +``` + +## Troubleshooting + +**Controller not running?** +```bash +kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts +kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts +``` + +**Rollout stuck?** +```bash +kubectl describe rollout tracing-demo-frontend -n services +kubectl get pods -n services -l app=tracing-demo-frontend +``` + +**Need plugin?** +```bash +curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 +sudo install -m 755 kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts +``` + +## Next Steps + +1. Complete setup using `ROLLOUTS-CHECKLIST.md` +2. Run demo scenarios from `ROLLOUTS-SETUP.md` +3. Share with team +4. Optional: Add Istio for advanced traffic routing +5. Optional: Deploy Flagger for automated analysis +6. Migrate other services to Rollout + +## Key Resources + +| File | Purpose | +|------|---------| +| `ARGO-ROLLOUTS-SUMMARY.md` | Architecture & what was created | +| `ROLLOUTS-SETUP.md` | Complete setup & 5 demo scenarios | +| `ROLLOUTS-CHECKLIST.md` | Step-by-step deployment | +| `tracing-demo/ROLLOUTS-DEMO.md` | Technical details & troubleshooting | +| `argo-rollouts/README.md` | Controller installation guide | + +## Support + +- Argo Rollouts Docs: https://argoproj.github.io/argo-rollouts/ +- Canary Strategy: https://argoproj.github.io/argo-rollouts/features/canary/ +- Kubectl Plugin: https://argoproj.github.io/argo-rollouts/getting-started/#using-kubectl-with-argo-rollouts diff --git a/f3s/tracing-demo/ROLLOUTS-CHECKLIST.md b/f3s/tracing-demo/ROLLOUTS-CHECKLIST.md new file mode 100644 index 0000000..b475f2d --- /dev/null +++ b/f3s/tracing-demo/ROLLOUTS-CHECKLIST.md @@ -0,0 +1,222 @@ +# Argo Rollouts Deployment Checklist + +Quick checklist for deploying and testing Argo Rollouts with canary demo. + +## Installation + +- [ ] Read `ARGO-ROLLOUTS-SUMMARY.md` - understand what was created +- [ ] Ensure kubectl access to f3s cluster +- [ ] Ensure ArgoCD is running +- [ ] Navigate to `/home/paul/git/conf/f3s/argo-rollouts` +- [ ] Run `just install` +- [ ] Verify controller: `kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts` +- [ ] Verify CRD: `kubectl get crd | grep rollout` +- [ ] (Optional) Install plugin: + ```bash + curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 + chmod +x kubectl-argo-rollouts-linux-amd64 + sudo install -m 755 kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts + kubectl argo rollouts version + ``` + +## ArgoCD Integration + +- [ ] Push changes to git-server: + ```bash + cd /home/paul/git/conf/f3s + git add -A && git commit -m "feat: add Argo Rollouts" + git push r0 master + ``` +- [ ] Verify ArgoCD app: + ```bash + kubectl get application argo-rollouts -n cicd + argocd app get argo-rollouts + ``` +- [ ] Verify tracing-demo app: + ```bash + kubectl get application tracing-demo -n cicd + argocd app get tracing-demo + ``` + +## Rollout Verification + +- [ ] Check rollout exists: `kubectl get rollout tracing-demo-frontend -n services` +- [ ] Verify status: `kubectl describe rollout tracing-demo-frontend -n services` +- [ ] Expected: `Status: Healthy` with `3/3 replicas` in stable state +- [ ] Check pods: `kubectl get pods -n services -l app=tracing-demo-frontend` +- [ ] All 3 pods should be `Running` + +## Demo: Basic Canary Rollout + +**Expected: 0-15s: canary starting, 15-60s: observing, 60-90s: promoting** + +### Terminal 1: Watch Rollout +```bash +cd /home/paul/git/conf/f3s/tracing-demo +just rollout-watch +``` +- [ ] Command runs and connects to cluster +- [ ] Waiting for rollout to start + +### Terminal 2: Trigger Rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` +- [ ] Patch command successful +- [ ] Terminal 1 shows change immediately + +### Terminal 1: Observe Progress +- [ ] See `Step: 0/3, SetWeight: 33` +- [ ] 1 canary pod becoming ready +- [ ] 3 stable pods still running +- [ ] After ~15 sec: canary pod ready +- [ ] After ~60 sec: auto-promotion starts +- [ ] After ~90 sec: all 3 pods running new version +- [ ] Status shows `Healthy` + +## Demo: Abort/Rollback + +**Expected: Stop rollout and keep old version running** + +### Terminal 1: Watch Rollout +```bash +just rollout-watch +``` + +### Terminal 2: Trigger Rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V2","value":"'$(date +%s)'"}}]' +``` + +### Terminal 3: Abort at Canary Step (after 20 seconds) +```bash +cd /home/paul/git/conf/f3s/tracing-demo +just rollout-abort +``` +- [ ] Abort command accepted +- [ ] Terminal 1 shows `Status: Aborted` +- [ ] Canary pods terminate +- [ ] Old 3 pods continue running +- [ ] Verify with: `just rollout-status` + +## Demo: Load Testing + +**Expected: Generate traffic while rollout happens** + +### Terminal 1: Watch Rollout +```bash +just rollout-watch +``` + +### Terminal 2: Start Load Test +```bash +just load-test & +``` +- [ ] Requests being sent + +### Terminal 3: Trigger Rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V3","value":"'$(date +%s)'"}}]' +``` +- [ ] Rollout progresses with active traffic +- [ ] Both old and new pods serve requests during canary phase + +## Monitoring + +- [ ] Check status: `kubectl argo rollouts status tracing-demo-frontend -n services` +- [ ] Detailed info: `kubectl argo rollouts describe rollout tracing-demo-frontend -n services` +- [ ] Pod details: `kubectl get pods -n services -l app=tracing-demo-frontend -o wide` +- [ ] View logs: `just logs-frontend` +- [ ] View history: `just rollout-history` + +## Grafana (Optional) + +- [ ] Open Grafana: https://grafana.f3s.buetow.org +- [ ] Navigate to Explore → Tempo datasource +- [ ] Query: `{ resource.service.name = "frontend" }` +- [ ] See traces from old and new versions during rollout + +## Integration with Git (GitOps) + +- [ ] Edit rollout config: + ```bash + nano /home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml + ``` +- [ ] Change any settings (e.g., duration, setWeight) +- [ ] Commit and push: + ```bash + git add -A && git commit -m "chore: adjust canary settings" + git push r0 master + ``` +- [ ] ArgoCD auto-syncs within 3 minutes (or force): + ```bash + kubectl annotate application tracing-demo -n cicd argocd.argoproj.io/refresh=normal --overwrite + ``` +- [ ] New settings take effect on next rollout trigger + +## Post-Demo + +- [ ] Abort any stuck rollouts: `just rollout-abort` +- [ ] Verify stable state: `just rollout-status` shows `Healthy` +- [ ] Review documentation: + - [ ] `ARGO-ROLLOUTS-SUMMARY.md` - architecture + - [ ] `ROLLOUTS-SETUP.md` - detailed scenarios + - [ ] `README-ROLLOUTS.md` - quick reference + - [ ] `tracing-demo/ROLLOUTS-DEMO.md` - technical details + +## Troubleshooting + +### Controller not running +```bash +kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts +kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts +``` +- [ ] Pod running and ready + +### Rollout not deployed +```bash +kubectl get rollout tracing-demo-frontend -n services +kubectl describe rollout tracing-demo-frontend -n services +``` +- [ ] Check events section for errors + +### Canary pods in ImagePullBackoff +- [ ] Use env var patch instead (don't change image tag): + ```bash + kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' + ``` + +### Rollout stuck in Progressing +```bash +kubectl describe rollout tracing-demo-frontend -n services +kubectl get pods -n services -l app=tracing-demo-frontend +``` +- [ ] Check pod readiness probes +- [ ] Check pod resource requests/limits +- [ ] Check controller logs + +## Next Steps + +- [ ] Run through all demo scenarios multiple times +- [ ] Modify rollout settings and observe behavior +- [ ] Monitor with Prometheus/Grafana +- [ ] Extend to other services (middleware, backend) +- [ ] Optional: Install Istio for advanced traffic routing +- [ ] Optional: Deploy Flagger for automated analysis + +--- + +**Setup Complete When:** +- ✅ Controller running in `cicd` namespace +- ✅ Rollout deployed in `services` namespace +- ✅ One full demo executed (0-90 seconds) +- ✅ Can abort and retry +- ✅ Team trained on canary deployments diff --git a/f3s/tracing-demo/ROLLOUTS-FILE-TREE.txt b/f3s/tracing-demo/ROLLOUTS-FILE-TREE.txt new file mode 100644 index 0000000..6c85754 --- /dev/null +++ b/f3s/tracing-demo/ROLLOUTS-FILE-TREE.txt @@ -0,0 +1,183 @@ +/home/paul/git/conf/f3s/ +├── README-ROLLOUTS.md ← ENTRY POINT (quick reference) +├── ARGO-ROLLOUTS-SUMMARY.md ← Full architecture & overview +├── ROLLOUTS-SETUP.md ← Detailed setup + 5 scenarios +├── ROLLOUTS-CHECKLIST.md ← Step-by-step deployment +├── ROLLOUTS-FILE-TREE.txt ← This file +│ +├── argo-rollouts/ ← NEW: Argo Rollouts Controller +│ ├── Justfile ← Install/upgrade/uninstall +│ ├── values.yaml ← Helm configuration +│ └── README.md ← Controller-specific guide +│ +├── argocd-apps/ +│ ├── cicd/ +│ │ ├── git-server.yaml +│ │ └── argo-rollouts.yaml ← NEW: Controller app +│ │ +│ └── services/ +│ ├── tracing-demo.yaml ← UPDATED: Deployment → Rollout +│ └── ... (other apps) +│ +├── tracing-demo/ +│ ├── README.md +│ ├── Justfile ← UPDATED: Added rollout commands +│ ├── ROLLOUTS-DEMO.md ← NEW: Technical walkthrough +│ ├── rollout-demo.sh ← NEW: Demo automation +│ │ +│ └── helm-chart/ +│ ├── Chart.yaml +│ └── templates/ +│ ├── frontend-rollout.yaml ← NEW: Canary rollout definition +│ ├── frontend-deployment.yaml ← KEPT: For reference +│ ├── middleware-deployment.yaml ← (unchanged) +│ ├── backend-deployment.yaml ← (unchanged) +│ ├── frontend-service.yaml +│ ├── middleware-service.yaml +│ ├── backend-service.yaml +│ └── ingress.yaml +│ +└── ... (other apps unchanged) + + +═══════════════════════════════════════════════════════════════════════════ + +INSTALLATION SUMMARY +═══════════════════════════════════════════════════════════════════════════ + +Step 1: Install Controller + cd /home/paul/git/conf/f3s/argo-rollouts + just install + +Step 2: Verify ArgoCD + argocd app sync argo-rollouts + argocd app sync tracing-demo + +Step 3: Watch Demo + cd /home/paul/git/conf/f3s/tracing-demo + just rollout-watch + +Step 4: Trigger Rollout (in another terminal) + kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"replace","path":"/spec/template/spec/containers/0/image","value":"registry.lan.buetow.org:30001/tracing-demo-frontend:latest"}]' + +═══════════════════════════════════════════════════════════════════════════ + +DOCUMENTATION ROADMAP +═══════════════════════════════════════════════════════════════════════════ + +NEW TO ARGO ROLLOUTS? + 1. Read: README-ROLLOUTS.md (3 min) + 2. Read: ARGO-ROLLOUTS-SUMMARY.md (10 min) + 3. Follow: ROLLOUTS-CHECKLIST.md (step-by-step) + +WANT DETAILED GUIDE? + → ROLLOUTS-SETUP.md + - Complete setup instructions + - 5 demo scenarios with expected output + - Monitoring dashboards + - Advanced patterns + +DOING THE DEPLOYMENT? + → ROLLOUTS-CHECKLIST.md + - Pre-deployment checks + - Installation steps + - Verification + - Troubleshooting + +TROUBLESHOOTING? + → ROLLOUTS-SETUP.md → Troubleshooting section + → argo-rollouts/README.md + → tracing-demo/ROLLOUTS-DEMO.md + +═══════════════════════════════════════════════════════════════════════════ + +KEY FILES EXPLAINED +═══════════════════════════════════════════════════════════════════════════ + +argo-rollouts/Justfile + - Automates installation of Argo Rollouts controller + - Commands: install, upgrade, uninstall, status, logs + - Deploys to: cicd namespace + +argo-rollouts/values.yaml + - Helm chart configuration for Argo Rollouts + - Sets resource limits, metrics, replicas + +argocd-apps/cicd/argo-rollouts.yaml + - ArgoCD Application resource + - Manages controller installation via GitOps + - Auto-syncs when argo-rollouts/ changes in git + +tracing-demo/helm-chart/templates/frontend-rollout.yaml + - Replaces frontend-deployment.yaml + - Defines canary strategy: + * Step 1: 50% traffic + * Step 2: 2-minute pause + * Step 3: 100% promotion + - Keeps same pods, volumes, env vars as Deployment + +tracing-demo/Justfile (updated) + - New commands for rollout management + - just rollout-watch + - just rollout-status + - just rollout-promote + - just rollout-abort + - just rollout-history + +tracing-demo/rollout-demo.sh + - Automation script for demo + - Checks prerequisites + - Guides through demo workflow + - Can be extended for CI/CD + +═══════════════════════════════════════════════════════════════════════════ + +WHAT CHANGED IN EXISTING FILES +═══════════════════════════════════════════════════════════════════════════ + +tracing-demo/Justfile + [+] 8 new rollout commands + [-] No breaking changes to existing commands + +tracing-demo/helm-chart/templates/frontend-deployment.yaml + [~] Still exists (for reference, not deployed) + [→] Replaced by frontend-rollout.yaml in deployment + +argocd-apps/services/tracing-demo.yaml + [+] RespectIgnoreDifferences=true sync option + [-] No other changes (points to same Helm chart) + +═══════════════════════════════════════════════════════════════════════════ + +WHAT DID NOT CHANGE +═══════════════════════════════════════════════════════════════════════════ + +✓ Middleware & Backend services remain Deployments +✓ All service definitions (frontend, middleware, backend services) +✓ Ingress configuration +✓ All other apps (audiobookshelf, miniflux, etc.) +✓ ArgoCD configuration & installation +✓ Prometheus/Grafana setup + +═══════════════════════════════════════════════════════════════════════════ + +HOW TO NAVIGATE THIS +═══════════════════════════════════════════════════════════════════════════ + +If you want to... See... +──────────────────────────────────────────────────────────────────────────── +Understand what was created ARGO-ROLLOUTS-SUMMARY.md +Get started quickly README-ROLLOUTS.md +Deploy step-by-step ROLLOUTS-CHECKLIST.md +See detailed scenarios & examples ROLLOUTS-SETUP.md +Troubleshoot issues ROLLOUTS-SETUP.md (Troubleshooting section) +Learn technical details tracing-demo/ROLLOUTS-DEMO.md +Install the controller argo-rollouts/Justfile + argo-rollouts/README.md +See the rollout definition tracing-demo/helm-chart/templates/frontend-rollout.yaml +Run a demo tracing-demo/rollout-demo.sh or just rollout-watch +Monitor during rollout Prometheus/Grafana (see ROLLOUTS-SETUP.md) +Integrate with CI/CD See ROLLOUTS-SETUP.md section "GitOps Flow" + +═══════════════════════════════════════════════════════════════════════════ diff --git a/f3s/tracing-demo/ROLLOUTS-SETUP.md b/f3s/tracing-demo/ROLLOUTS-SETUP.md new file mode 100644 index 0000000..0ea965c --- /dev/null +++ b/f3s/tracing-demo/ROLLOUTS-SETUP.md @@ -0,0 +1,373 @@ +# Argo Rollouts Setup and Demo Guide + +Complete setup and demonstration of Argo Rollouts with the tracing-demo application. Canary strategy: 33% traffic (1 pod) for 1 minute, then auto-promote to 100%. + +## Quick Setup + +### 1. Install Argo Rollouts Controller + +```bash +cd /home/paul/git/conf/f3s/argo-rollouts +just install +``` + +Verify installation: +```bash +kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts +kubectl get crd | grep rollout +``` + +### 2. Install kubectl Plugin (Optional but Recommended) + +```bash +curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64 +chmod +x kubectl-argo-rollouts-linux-amd64 +sudo install -m 755 kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts +``` + +Verify: +```bash +kubectl argo rollouts version +``` + +### 3. Sync ArgoCD with New Applications + +```bash +argocd app sync argo-rollouts +argocd app sync tracing-demo +``` + +### 4. Verify Rollout is Deployed + +```bash +kubectl get rollout tracing-demo-frontend -n services +kubectl describe rollout tracing-demo-frontend -n services +``` + +Expected status: `Healthy` with `3/3 replicas` in stable state. + +## Quick Demo (90 seconds) + +### Terminal 1 - Watch Progress + +```bash +cd /home/paul/git/conf/f3s/tracing-demo +just rollout-watch +``` + +Or use the kubectl command directly: +```bash +kubectl argo rollouts get rollout tracing-demo-frontend -n services --watch +``` + +### Terminal 2 - Trigger Rollout + +Wait 10 seconds for Terminal 1 to start watching, then trigger: + +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` + +### Watch the Timeline + +**Terminal 1 will show:** + +``` +Step: 0/3 +SetWeight: 33 +Canary: 1 pod (new version) - starting +Stable: 3 pods (old version) - handling requests +``` + +→ After 15 seconds, canary pod becomes ready: + +``` +Step: 1/3 +SetWeight: 33 +Canary: 1 pod (new version) - ready, receiving 33% traffic +Stable: 3 pods (old version) - receiving 67% traffic +``` + +→ After ~60 seconds, auto-promotion begins: + +``` +Step: 2/3 +SetWeight: 100 +Canary scaling → Stable +``` + +→ After ~90 seconds, complete: + +``` +Status: Healthy +Replicas: 3/3 all running new version +``` + +## Demo Scenarios + +### Scenario 1: Observe the Full Rollout + +Just follow the "Quick Demo" above. Watch all three steps progress automatically over 90 seconds. + +### Scenario 2: Abort Rollout (Simulate Failure) + +**Terminal 1**: Watch the rollout +```bash +just rollout-watch +``` + +**Terminal 2**: Trigger rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` + +**Terminal 3 (while at step 1)**: Abort the rollout +```bash +cd /home/paul/git/conf/f3s/tracing-demo +just rollout-abort +``` + +Result: +- Canary pods terminate +- Old 3 pods continue running +- Status shows "Aborted" + +Verify: +```bash +just rollout-status +``` + +### Scenario 3: Load Testing During Rollout + +**Terminal 1**: Watch rollout +```bash +just rollout-watch +``` + +**Terminal 2**: Start load test +```bash +just load-test & +``` + +**Terminal 3**: Trigger rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` + +Load test will hit both old and new pods during the 1-minute canary window. + +### Scenario 4: Check Logs During Rollout + +**Terminal 1**: Watch rollout +```bash +just rollout-watch +``` + +**Terminal 2**: Trigger rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` + +**Terminal 3**: Watch logs +```bash +kubectl logs -n services -l app=tracing-demo-frontend -f --tail=20 +``` + +See logs from both old and new pods. + +### Scenario 5: Monitor via Grafana Tempo (Distributed Tracing) + +**Terminal 1**: Watch rollout +```bash +just rollout-watch +``` + +**Terminal 2**: Trigger rollout +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` + +**Terminal 3**: Open Grafana +1. Navigate to https://grafana.f3s.buetow.org +2. Go to Explore → Select "Tempo" datasource +3. Query: `{ resource.service.name = "frontend" }` +4. See traces from both old and new versions during canary phase + +## Timeline Breakdown + +| Time | Event | Status | +|------|-------|--------| +| 0s | Trigger rollout | Rollout starts | +| 0-5s | Canary pod created | `Step 0/3: SetWeight 33` | +| 5-15s | Canary pod becoming ready | Still not ready | +| 15s | Canary pod ready | `Step 1/3: SetWeight 33, canary ready` | +| 15-60s | Observing canary | Requests split 67/33 (old/new) | +| 60s | Auto-promotion triggered | `Step 2/3: SetWeight 100` | +| 60-70s | Scaling new pods | Canary → Stable | +| 70-80s | Terminating old pods | Old pods scaling down | +| ~90s | Complete | `Status: Healthy, 3/3 replicas` | + +## Monitoring During Rollout + +### kubectl Commands + +Real-time status: +```bash +kubectl argo rollouts get rollout tracing-demo-frontend -n services --watch +``` + +Check specific details: +```bash +kubectl argo rollouts describe rollout tracing-demo-frontend -n services +kubectl argo rollouts history tracing-demo-frontend -n services +``` + +Pod status: +```bash +kubectl get pods -n services -l app=tracing-demo-frontend -o wide +``` + +### Prometheus Metrics + +```bash +# Port-forward Prometheus +kubectl port-forward -n monitoring svc/prometheus 9090:9090 +``` + +Then query: +```promql +# Pod counts during rollout +kube_replicaset_replicas{replicaset=~"tracing-demo-frontend.*"} + +# Pod status +kube_pod_status_phase{namespace="services", pod=~"tracing-demo-frontend.*"} + +# Pod age (shows which are old vs new) +time() - kube_pod_created{namespace="services", pod=~"tracing-demo-frontend.*"} +``` + +### Grafana Dashboards + +1. Open Grafana: https://grafana.f3s.buetow.org +2. Explore → Tempo datasource +3. Query: `{ resource.service.name = "frontend" }` +4. See traces from old and new versions +5. Notice latency/error differences during rollout + +## Rollout Configuration + +Located in: `/home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml` + +Key settings: +```yaml +replicas: 3 # 3 pods total +strategy: + canary: + steps: + - setWeight: 33 # Send 1 pod (33%) to canary + - pause: + duration: 1m # Wait 1 minute, then auto-promote + - setWeight: 100 # Promote all to new version +``` + +To modify pause duration: +```bash +# Edit the file +nano /home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml + +# Change duration: 1m to duration: 5m (for example) +# Then commit and push +git add -A && git commit -m "chore: extend canary pause to 5 minutes" +git push r0 master +``` + +ArgoCD will auto-sync the new rollout configuration. + +## Troubleshooting + +### Rollout shows "ErrImagePull" on canary pod + +This happens if using an image tag that doesn't exist. The env var patch approach forces a rollout without changing the image, so use: + +```bash +kubectl patch rollout tracing-demo-frontend -n services \ + --type='json' \ + -p='[{"op":"add","path":"/spec/template/spec/containers/0/env/-","value":{"name":"ROLLOUT_V","value":"'$(date +%s)'"}}]' +``` + +### Rollout stuck in "Progressing" + +Check pod status: +```bash +kubectl describe rollout tracing-demo-frontend -n services +kubectl get pods -n services -l app=tracing-demo-frontend +``` + +Check controller logs: +```bash +kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts --tail=50 +``` + +### Controller not running + +```bash +kubectl get pods -n cicd -l app.kubernetes.io/name=argo-rollouts +kubectl logs -n cicd -l app.kubernetes.io/name=argo-rollouts +``` + +### Auto-promotion not happening + +Verify pause duration is set: +```bash +kubectl get rollout tracing-demo-frontend -n services -o yaml | grep -A 5 "pause:" +``` + +## Advanced: Modify Canary Parameters + +### Increase observation time to 5 minutes + +```bash +# Edit rollout YAML +nano /home/paul/git/conf/f3s/tracing-demo/helm-chart/templates/frontend-rollout.yaml + +# Change: +# - pause: +# duration: 1m +# To: +# - pause: +# duration: 5m + +git add -A && git commit -m "chore: extend canary pause to 5 minutes" +git push r0 master +``` + +### Reduce traffic weight to canary (more conservative) + +```yaml +steps: +- setWeight: 10 # Only 10% traffic (0.3 pods worth) +- pause: + duration: 2m # Observe longer +- setWeight: 100 +``` + +### Add health check analysis (requires Flagger or ArgoCD Analysis) + +For automated rollback based on error rate thresholds, see `/home/paul/git/conf/f3s/ROLLOUTS-SETUP.md` → "Advanced: Custom Analysis" section. + +## References + +- [Argo Rollouts Canary Strategy](https://argoproj.github.io/argo-rollouts/features/canary/) +- [Argo Rollouts Best Practices](https://argoproj.github.io/argo-rollouts/best-practices/) +- [kubectl-argo-rollouts Plugin](https://argoproj.github.io/argo-rollouts/getting-started/#using-kubectl-with-argo-rollouts) +- [Flagger for Automated Analysis](https://flagger.app/) diff --git a/f3s/wireguardroaming-plan.md b/f3s/wireguardroaming-plan.md deleted file mode 100644 index 20240b6..0000000 --- a/f3s/wireguardroaming-plan.md +++ /dev/null @@ -1,528 +0,0 @@ -# Plan: Add Fedora Laptop (earth) and Android Phone (pixel7pro) as WireGuard VPN Clients - -## Overview -Add two new roaming clients (earth - Fedora laptop, pixel7pro - Android phone) to the existing WireGuard full-mesh VPN, connecting them to all 8 existing hosts (f0-f2, r0-r2, blowfish, fishfinger). - -## Background -- Current VPN: Full mesh of 8 hosts (3 FreeBSD, 3 Rocky Linux, 2 OpenBSD) -- VPN network: 192.168.2.0/24 -- Mesh generator: `/home/paul/git/wireguardmeshgenerator/` -- Generator creates configs and deploys via SSH to remote hosts -- Reference: https://foo.zone/gemfeed/2025-05-11-f3s-kubernetes-with-freebsd-part-5.html - -## Challenge: Roaming Client Support -The current generator doesn't properly handle roaming clients (devices behind NAT that need PersistentKeepalive to all peers). The existing logic only sets PersistentKeepalive for LAN-to-internet connections, but roaming clients need it for ALL connections to maintain NAT traversal. - -## Implementation Steps - -### 1. Modify WireGuard Mesh Generator to Support Roaming Clients - -**File:** `/home/paul/git/wireguardmeshgenerator/wireguardmeshgenerator.rb` - -**Changes needed in the `WireguardConfig#peers` method (lines 149-163):** - -Current logic: -```ruby -keepalive = in_lan && !peer_in_lan -``` - -New logic should detect roaming clients (hosts with neither 'lan' nor 'internet' keys) and enable keepalive for all their peer connections: - -```ruby -# Detect if current host is a roaming client (no lan or internet section) -is_roaming = !hosts[myself].key?('lan') && !hosts[myself].key?('internet') - -# Set keepalive: LAN hosts connecting to internet hosts, OR roaming clients connecting to anyone -keepalive = is_roaming || (in_lan && !peer_in_lan) -``` - -**Alternative simpler approach:** -For roaming clients, set keepalive for all peers since they're always behind NAT: -```ruby -# Check if current host is roaming (no fixed location) -is_roaming = !hosts[myself].key?('lan') && !hosts[myself].key?('internet') -keepalive = is_roaming || (in_lan && !peer_in_lan) -``` - -### 2. Add Laptop and Phone to YAML Configuration - -**File:** `/home/paul/git/wireguardmeshgenerator/wireguardmeshgenerator.yaml` - -Add two new host entries (after the existing 8 hosts): - -```yaml - earth: - os: Linux - wg0: - domain: 'wg0.wan.buetow.org' - ip: '192.168.2.200' - # Note: No 'lan' or 'internet' section = roaming client - # Note: No 'ssh' section = manual installation - - pixel7pro: - os: Android - wg0: - domain: 'wg0.wan.buetow.org' - ip: '192.168.2.201' - # Note: No 'lan' or 'internet' section = roaming client - # Note: No 'ssh' section = manual installation -``` - -**Key design decisions:** -- IP addresses: 192.168.2.200 (earth), 192.168.2.201 (pixel7pro) -- No `lan` or `internet` sections → identified as roaming clients -- No `ssh` section → configs will be manually installed (not via `rake install`) -- `os` field for documentation purposes - -### 3. Update All Hosts to Include New Clients - -**File:** `/home/paul/git/wireguardmeshgenerator/wireguardmeshgenerator.yaml` - -Each existing host (f0-f2, r0-r2, blowfish, fishfinger) will automatically include earth and pixel7pro as peers when configs are regenerated. No changes needed to existing host definitions. - -### 4. Generate New Configurations - -**Command:** -```bash -cd /home/paul/git/wireguardmeshgenerator -rake generate -``` - -This generates new `wg0.conf` files in `dist/` for all 10 hosts (8 existing + 2 new). - -**Expected output:** -``` -dist/ -├── blowfish/etc/wireguard/wg0.conf -├── earth/etc/wireguard/wg0.conf ← NEW -├── f0/etc/wireguard/wg0.conf -├── f1/etc/wireguard/wg0.conf -├── f2/etc/wireguard/wg0.conf -├── fishfinger/etc/wireguard/wg0.conf -├── pixel7pro/etc/wireguard/wg0.conf ← NEW -├── r0/etc/wireguard/wg0.conf -├── r1/etc/wireguard/wg0.conf -└── r2/etc/wireguard/wg0.conf -``` - -### 5. Deploy Updated Configs to Existing Hosts - -**Command:** -```bash -cd /home/paul/git/wireguardmeshgenerator -rake install -``` - -OR selectively update only existing hosts: -```bash -ruby wireguardmeshgenerator.rb --install --hosts=f0,f1,f2,r0,r1,r2,blowfish,fishfinger -``` - -This updates all 8 existing hosts to include earth and pixel7pro in their peer lists. The script will SSH to each host, upload the config, and reload WireGuard. - -### 6. Update /etc/hosts on All Participating Hosts - -Add DNS entries for the new VPN clients to all hosts in the mesh for easier access. - -**On each of the 8 existing hosts (f0-f2, r0-r2, blowfish, fishfinger):** - -Add these lines to `/etc/hosts`: -``` -192.168.2.200 earth.wg0.wan.buetow.org earth -192.168.2.201 pixel7pro.wg0.wan.buetow.org pixel7pro -``` - -**Manual approach:** -```bash -# On each host (f0, f1, f2, r0, r1, r2, blowfish, fishfinger) -echo "192.168.2.200 earth.wg0.wan.buetow.org earth" | sudo tee -a /etc/hosts -echo "192.168.2.201 pixel7pro.wg0.wan.buetow.org pixel7pro" | sudo tee -a /etc/hosts -``` - -**On earth (laptop), add entries for all mesh hosts:** -```bash -# Add to /etc/hosts -echo "# WireGuard mesh hosts" | sudo tee -a /etc/hosts -echo "192.168.2.130 f0.wg0.wan.buetow.org f0" | sudo tee -a /etc/hosts -echo "192.168.2.131 f1.wg0.wan.buetow.org f1" | sudo tee -a /etc/hosts -echo "192.168.2.132 f2.wg0.wan.buetow.org f2" | sudo tee -a /etc/hosts -echo "192.168.2.120 r0.wg0.wan.buetow.org r0" | sudo tee -a /etc/hosts -echo "192.168.2.121 r1.wg0.wan.buetow.org r1" | sudo tee -a /etc/hosts -echo "192.168.2.122 r2.wg0.wan.buetow.org r2" | sudo tee -a /etc/hosts -echo "192.168.2.110 blowfish.wg0.wan.buetow.org blowfish" | sudo tee -a /etc/hosts -echo "192.168.2.111 fishfinger.wg0.wan.buetow.org fishfinger" | sudo tee -a /etc/hosts -echo "192.168.2.201 pixel7pro.wg0.wan.buetow.org pixel7pro" | sudo tee -a /etc/hosts -``` - -**Note:** The WireGuard mesh generator doesn't automatically manage /etc/hosts, so this is a manual step. - -### 7. Install WireGuard on Fedora Laptop (earth) - -**Commands on earth:** -```bash -# Install WireGuard -sudo dnf install wireguard-tools - -# Copy generated config -sudo cp /home/paul/git/wireguardmeshgenerator/dist/earth/etc/wireguard/wg0.conf /etc/wireguard/ - -# Set proper permissions -sudo chmod 600 /etc/wireguard/wg0.conf - -# Enable and start -sudo systemctl enable --now wg-quick@wg0.service - -# Verify connection -sudo wg show -``` - -**Expected result:** -- Interface wg0 up with IP 192.168.2.200 -- Handshakes established with all 8 peers -- Can ping other hosts (e.g., `ping 192.168.2.130` for f0) - -### 8. Install WireGuard on Android Phone (pixel7pro) - -**Client:** Official WireGuard Android client from Google Play Store - -**Steps:** -1. Install the official WireGuard app from Google Play Store (https://play.google.com/store/apps/details?id=com.wireguard.android) -2. Transfer config file: - - Copy `/home/paul/git/wireguardmeshgenerator/dist/pixel7pro/etc/wireguard/wg0.conf` to phone - - OR generate QR code: `qrencode -t ansiutf8 < dist/pixel7pro/etc/wireguard/wg0.conf` -3. Import config into WireGuard app (either via file import or QR code scan) -4. Activate the tunnel - -**Expected result:** -- Tunnel shows as active in WireGuard app -- Status shows connected peers -- Can access VPN network (test with ping or accessing internal services) - -## Critical Files - -### To Modify -- `/home/paul/git/wireguardmeshgenerator/wireguardmeshgenerator.rb` (lines ~149-163) -- `/home/paul/git/wireguardmeshgenerator/wireguardmeshgenerator.yaml` (add laptop and phone entries) - -### Generated (Review) -- `/home/paul/git/wireguardmeshgenerator/dist/earth/etc/wireguard/wg0.conf` -- `/home/paul/git/wireguardmeshgenerator/dist/pixel7pro/etc/wireguard/wg0.conf` - -### Keys Generated -- `/home/paul/git/wireguardmeshgenerator/keys/earth/pub.key` -- `/home/paul/git/wireguardmeshgenerator/keys/earth/priv.key` -- `/home/paul/git/wireguardmeshgenerator/keys/pixel7pro/pub.key` -- `/home/paul/git/wireguardmeshgenerator/keys/pixel7pro/priv.key` -- `/home/paul/git/wireguardmeshgenerator/keys/psk/earth_*.key` (8 preshared keys) -- `/home/paul/git/wireguardmeshgenerator/keys/psk/pixel7pro_*.key` (8 preshared keys) - -## Verification - -### On earth (Laptop) -```bash -# Check interface status -sudo wg show - -# Verify connectivity to all hosts -for host in 130 131 132 120 121 122 110 111; do - ping -c1 192.168.2.$host && echo "✓ 192.168.2.$host reachable" -done - -# Test access to services (e.g., Prometheus) -curl http://192.168.2.130:9100/metrics # f0 node-exporter - -# Test hostname resolution -ping -c1 f0 -ping -c1 blowfish -``` - -### On pixel7pro (Phone) -- WireGuard app shows active tunnel -- Status shows recent handshakes with all 8 peers -- Can access internal services (test with browser to 192.168.2.120:30090 for Prometheus) - -### On Existing Hosts -```bash -# On any existing host (e.g., SSH to f0) -sudo wg show - -# Should see two new peers: -# - earth (192.168.2.200) -# - pixel7pro (192.168.2.201) - -# Test hostname resolution -ping -c1 earth -ping -c1 pixel7pro -``` - -## Configuration Details - -### earth (Laptop) Config Structure -``` -[Interface] -Address = 192.168.2.200 -PrivateKey = -ListenPort = 56709 - -[Peer] # f0 -PublicKey = -PresharedKey = -AllowedIPs = 192.168.2.130/32 -Endpoint = 192.168.1.130:56709 -PersistentKeepalive = 25 ← NEW: Enabled for roaming client - -[Peer] # f1 -... -(continues for all 8 peers) -``` - -### pixel7pro (Phone) Config Structure -Identical to earth, but with: -- Interface Address: 192.168.2.201 -- Different private key -- Different preshared keys - -## Notes - -1. **Roaming vs Fixed Clients:** - - Roaming clients have no `lan` or `internet` section in YAML - - They get PersistentKeepalive to ALL peers - - They have no incoming Endpoint (behind NAT) - -2. **Security:** - - Each peer relationship uses a unique preshared key - - Private keys are never transmitted - - Configs contain sensitive keys - protect them - -3. **Connection Behavior:** - - earth/pixel7pro will initiate connections to all peers - - If on same LAN as f0-f2/r0-r2, will use LAN IPs (192.168.1.x) - - If remote, will connect to blowfish/fishfinger public IPs, and LAN hosts will be unreachable (behind NAT) - -4. **Multiple Gateway Strategy:** - - With full mesh, earth/pixel7pro can reach services through any reachable peer - - If blowfish is down, can route through fishfinger - - If both internet gateways are down, no access (expected for roaming clients) - -5. **Git Repository:** - - The f3s repository doesn't contain WireGuard configs (managed separately) - - Changes are in wireguardmeshgenerator repo only - - Consider committing updated YAML and script to version control - -## Quick Reference Commands - -```bash -# Generate configs -cd /home/paul/git/wireguardmeshgenerator && rake generate - -# Deploy to all existing hosts -rake install - -# Or deploy to specific hosts -ruby wireguardmeshgenerator.rb --install --hosts=f0,f1,f2,r0,r1,r2,blowfish,fishfinger - -# Update /etc/hosts on existing hosts (run on each host) -echo "192.168.2.200 earth.wg0.wan.buetow.org earth" | sudo tee -a /etc/hosts -echo "192.168.2.201 pixel7pro.wg0.wan.buetow.org pixel7pro" | sudo tee -a /etc/hosts - -# Install on earth (laptop) -sudo cp dist/earth/etc/wireguard/wg0.conf /etc/wireguard/ -sudo chmod 600 /etc/wireguard/wg0.conf -sudo systemctl enable --now wg-quick@wg0.service - -# Add mesh hosts to earth's /etc/hosts - -# Check status -sudo wg show -``` - -## Failover Limitation and Solutions - -### The Problem - -WireGuard **does not support automatic failover** by design. When both peers (blowfish and fishfinger) are configured with `AllowedIPs = 0.0.0.0/0`, the following behavior occurs: - -1. The client establishes connection to one peer (typically the first to respond) -2. The client remains "sticky" to that peer as long as packets can be sent -3. Even when the active peer goes down, the client does not immediately switch to the backup peer -4. Detection of peer failure can take several minutes due to: - - PersistentKeepalive interval (25 seconds) - - Network timeout detection - - Lack of active health monitoring in WireGuard protocol - -**Test results:** -- Stopped WireGuard on fishfinger (doas ifconfig wg0 down) -- Phone continued showing fishfinger's IP (Netherlands) -- Blowfish showed old handshake (17+ minutes) -- No automatic failover occurred - -### Why WireGuard Doesn't Have Failover - -WireGuard's design philosophy prioritizes simplicity and security over complex features. The protocol intentionally avoids implementing: -- Active peer health monitoring -- Automatic peer selection logic -- Load balancing or failover mechanisms - -The official stance: failover should be handled at higher layers (routing protocols, external monitoring, load balancers). - -### Possible Solutions - -#### Option 1: Manual Failover (Simplest) -**Current state - accept the limitation:** -- Keep both peers configured in the client -- User manually disconnects and reconnects to trigger new peer selection -- Or switch between two saved configs (one with fishfinger primary, one with blowfish primary) - -**Pros:** -- Simple, no code changes needed -- Reliable once user intervenes - -**Cons:** -- Requires manual intervention -- Downtime until user notices and acts - -#### Option 2: Single Primary Peer (Recommended for reliability) -**Configure only one peer as primary:** -- Edit pixel7pro config to include only fishfinger (or only blowfish) -- Keep backup config file for manual switchover if needed -- User loads backup config if primary gateway fails - -**Implementation:** -```yaml -# In wireguardmeshgenerator.yaml, add to pixel7pro: -exclude_peers: - - blowfish # To use only fishfinger - # OR - - fishfinger # To use only blowfish -``` - -**Pros:** -- Clear primary/backup designation -- No routing conflicts -- Faster to troubleshoot - -**Cons:** -- Still requires manual intervention for failover -- Only one gateway used at a time - -#### Option 3: Split AllowedIPs (Partial redundancy) -**Divide IP space between peers:** -``` -[Peer] # blowfish -AllowedIPs = 0.0.0.0/1, 128.0.0.0/2, 192.0.0.0/3, ...::/0 - -[Peer] # fishfinger -AllowedIPs = 128.0.0.0/1 -``` - -**Pros:** -- Both peers actively used -- Provides load distribution -- Partial redundancy (if one fails, half of internet still works) - -**Cons:** -- Complex routing setup -- Not true failover (loses half of routes if one peer fails) -- DNS may fail if routed through dead peer - -#### Option 4: External Monitoring (Complex) -**Use external script/app to monitor and switch:** -- Background app on Android monitors peer health -- Automatically reconfigures WireGuard when failure detected -- Requires custom app development - -**Pros:** -- Truly automatic failover - -**Cons:** -- Complex implementation -- Requires additional software -- May drain battery -- Not officially supported - -### Recommended Approach - -For phone (pixel7pro): **Accept manual failover** with the current dual-peer configuration. - -**Reasoning:** -- Phone usage is typically interactive - user will notice connectivity issues quickly -- User can manually disconnect/reconnect WireGuard to trigger failover -- Keeps both gateways as options without complex scripts -- Simple and reliable - -For automation-critical use cases (servers, IoT): Use **Option 2** with monitoring that sends alerts, allowing quick manual intervention. - -### Current Configuration Status - -**pixel7pro config** (/home/paul/git/wireguardmeshgenerator/dist/pixel7pro/etc/wireguard/wg0.conf): -- Two peers: blowfish (23.88.35.144) and fishfinger (46.23.94.99) -- Both with AllowedIPs = 0.0.0.0/0, ::/0 -- Both with PersistentKeepalive = 25 -- Both with DNS = 1.1.1.1, 8.8.8.8 - -**Observed behavior:** -- Client prefers fishfinger (first peer listed in some WireGuard client implementations) -- Both peers maintain handshakes, but only one actively routes traffic -- No automatic switchover when active peer fails - -## Completion Status - -### Completed Tasks - -* ✓ **Automatic failover investigation**: Documented as limitation. WireGuard does not support automatic failover by design. Manual reconnection required when active gateway fails. - -* ✓ **OpenBSD NAT rules deployed via IaC**: - - Created `/home/paul/git/conf/frontends/etc/pf.conf.tpl` with WireGuard NAT rules - - Added Rex task 'pf' to deploy pf.conf to both frontends - - Deployed successfully to blowfish and fishfinger - - Both gateways now have consistent firewall rules managed via IaC - - Committed to conf repo (commit 99a91d4) - -* ✓ **WireGuard tunnel works on earth**: - - Config installed at /etc/wireguard/wg0.conf - - Manual start verified (handshakes established with both gateways) - - Auto-start disabled (systemctl disable wg-quick@wg0.service) - - Currently stopped as requested - -* ✓ **Committed and pushed changes**: - - conf repo: pf.conf.tpl, Rexfile, wireguardroaming-plan.md (commit 99a91d4) - - wireguardmeshgenerator repo: Gemfile, wireguardmeshgenerator.rb, wireguardmeshgenerator.yaml (commit a6984e1) - - Both repos pushed to remote - -* ✓ **Blog post updated**: - - Updated header with "last updated Sun 11 Jan 21:33:40 EET 2026" - - Added "Update: Roaming Client Support Added" section after TOC - - Added comprehensive "Adding Roaming Clients" section before conclusion - - Updated introduction paragraph to mention roaming clients - - Committed to foo.zone-content/gemtext (commits dc65c06f, e5a0cf29) - -* ✓ **Mesh network graph updated**: - - Created Python script to generate updated visualization - - New graph includes earth and pixel7pro as purple roaming clients - - Shows blue dashed lines from clients to gateways only - - Preserves original full mesh (gray lines) for infrastructure hosts - - Color-coded by OS: FreeBSD (red), Rocky Linux (teal), OpenBSD (yellow) - - Saved as wireguard-full-mesh-with-roaming.svg - - Added graph reference to blog post - - Committed to foo.zone-content/gemtext (commit e5a0cf29) - -## Implementation Complete - -All tasks for adding WireGuard roaming client support have been completed: - -1. ✅ Modified wireguardmeshgenerator.rb for roaming client detection -2. ✅ Added earth and pixel7pro to YAML configuration -3. ✅ Generated and deployed configs to all hosts -4. ✅ Configured OpenBSD NAT rules via IaC (PF firewall) -5. ✅ Verified WireGuard works on earth (manual start) -6. ✅ Tested pixel7pro Android connectivity -7. ✅ Documented automatic failover limitation -8. ✅ Committed and pushed all code changes -9. ✅ Updated blog post with comprehensive documentation -10. ✅ Generated updated mesh network visualization - -**Repositories updated:** -- `conf` (commit 99a91d4): pf.conf.tpl, Rexfile, wireguardroaming-plan.md -- `wireguardmeshgenerator` (commit a6984e1): Ruby code, YAML config, Gemfile -- `foo.zone` (commits dc65c06f, e5a0cf29): blog post, mesh graph - -- cgit v1.2.3