summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
13 daysfix(loop-scheduler): replace ctx.waitForIdle() with isIdle() polling in ↵mainPaul Buetow
agent_end handler The agent_end event handler receives ExtensionContext, not ExtensionCommandContext, so ctx.waitForIdle() is not available. Replace with a polling loop using ctx.isIdle() to wait for the run to finish before draining pending jobs, preventing stuck follow-up messages.
13 daysfix(loop-scheduler): await waitForIdle in agent_end before drainingPaul Buetow
Inside an agent_end listener, agent.state.isStreaming is still true — finishRun() only clears it in the finally block of runWithLifecycle, after all agent_end listeners settle. So when we dispatched a pending job from agent_end and called pi.sendUserMessage(..., { deliverAs: 'followUp' }), the message was routed into agent.followUpQueue. The agent loop had already passed its getFollowUpMessages() check, so it exited without draining the queue. The message sat there as a stuck 'Follow-up: ...' in pi's UI, agentBusy stayed true forever, and every subsequent pending loop was blocked because no further agent_end fired. Await ctx.waitForIdle() in the agent_end handler before resetting agentBusy and calling drainPendingJobs. By then finishRun() has cleared isStreaming, so sendUserMessage starts a fresh run instead of enqueueing into a dead followUp queue, and pending loops drain serially as designed. Amp-Thread-ID: https://ampcode.com/threads/T-019e5de9-a0c3-7559-9cf0-f81ce751e763 Co-authored-by: Amp <amp@ampcode.com>
13 daysfix(watch): auto-recover when default VM is dead or replacedPaul Buetow
- Add per-VM 10s fetch timeout so one dead VM cannot stall the dashboard - Make fallback logic check VM state (public_ip + ACTIVE status) instead of just file existence, so a stale/deleted VM1 state does not block watch - Auto-replace cached SSH host keys when a VM is recreated instead of failing - Suppress Ruby thread exception noise on killed SSH threads Fixes 'just watch' showing blank screen when VM1 is deleted but has a stale state file, and SSH host-key mismatch on VM recreation.
13 daysupdate hyperstack2 VM state and config after recreationPaul Buetow
14 daysfix(provisioning): recover from vLLM readiness timeout and increase poll windowPaul Buetow
When create timed out during vLLM readiness polling (common for large models like Qwen3.6-27B-FP8), rerunning create would stop and restart the already-running container, restarting the whole startup sequence. Now the vLLM install script checks if the container is already running and serving the correct model before touching it. If it detects a healthy container, it skips the stop/pull/start cycle entirely. Also increases the readiness timeout from 20 min (240x5s) to 30 min (360x5s) to accommodate cold starts with model download and CUDA graph capture on large models.
14 daysfeat(tooling): add ollama fish abbreviations for kimi, qwen, glm, minimaxPaul Buetow
14 daysfeat(pi): add ollama provider with kimi-k2.6:cloud, qwen3.5:cloud, ↵Paul Buetow
glm-5.1:cloud, minimax-m2.7:cloud
14 dayschore(config): revert vm2 default to n3-A100x1; simplify justfilePaul Buetow
14 dayschore(tooling): add justfile for common VM lifecycle, observability, and ↵Paul Buetow
debugging commands
2026-05-24chore(vm2): H100 provisioning, L40 plan, and H100-specific vLLM tuningPaul Buetow
2026-05-24fix(cli): watch/status/test auto-detect active VMs when default VM1 is not ↵Paul Buetow
provisioned
2026-05-24chore(config): remove gpt-oss-120b references since qwen3.6 is betterPaul Buetow
2026-05-24fix(watcher): show actionable error when VM not provisioned or SSH failsPaul Buetow
2026-05-24replace qwen3-coder-next with qwen3.6-27b across configs, docs, and toolingPaul Buetow
2026-05-24feat(watch): retry SSH connection failures with exponential backoffPaul Buetow
Remove the vm_api_reachable? filter from run_watch so VMs that are currently booting are not silently dropped from the dashboard. Add exponential-backoff retry logic (up to 4 attempts, sleeping 2s, 4s, 8s, 16s) inside VllmWatcher#fetch_vm_stats for transient SSH/WireGuard errors such as connection refused, host unreachable, and exit 255. This lets watch automatically recover while a VM is still starting up instead of failing immediately.
2026-05-24chore: add pi/prompt-history.json to .gitignorePaul Buetow
2026-05-24fix(loop-scheduler): always pass deliverAs followUp for scheduled messagesPaul Buetow
The runtime now requires a streamingBehavior (steer/followUp) to queue a message when the agent is already processing. Previously only Gemma 4 models passed { deliverAs: 'followUp' }, causing all other models to throw 'Agent is already processing' and leaving the job stuck in pending. Scheduled and watch messages are independent prompts, so followUp is the correct behavior for all models.
2026-05-24chore(gitignore): ignore hyperstack state temp filesPaul Buetow
Add patterns for .hyperstack-*-state.json and .hyperstack-*-state.json.known_hosts to keep ephemeral VM state and WireGuard artifacts out of version control.
2026-05-24docs: refresh README, hypr.fish, AGENTS.md for consolidated --vm CLIPaul Buetow
2026-05-24feat(cli): replace --config with --vm 1|2|both, remove create-both/delete-bothPaul Buetow
- Drop single-VM default hyperstack-vm.toml and @config_path/@config_explicit machinery - Add global --vm flag (default: 1) mapping to hyperstack-vm1.toml and/or hyperstack-vm2.toml - Fold create-both and delete-both into create/delete --vm both - Teach status, watch, test, model to accept --vm (default: 1) - Update help text and README/AGENTS/fish abbreviations accordingly
2026-05-24docs: remove single-VM and ComfyUI/photo referencesPaul Buetow
2026-05-24cleanup: remove ComfyUI and photo-related code from lib/hyperstackPaul Buetow
2026-05-24chore: remove photo/ComfyUI top-level filesPaul Buetow
Delete hyperstack-vm-photo.toml, photo-enhance.rb, photo-enhance-review.md, smart_photo_node.py, workflows/photo-enhance.json (and empty workflows/ dir), and __pycache__/smart_photo_node.cpython-314.pyc (and empty __pycache__/ dir). No .hyperstack-vm-photo-state.json* state files were present. ComfyUI references in lib/hyperstack/*.rb intentionally left for task T2.
2026-05-24fix(loop-scheduler): reset agentBusy when drainPending detects idle contextPaul Buetow
The agentBusy flag could get stuck if an agent_start event fired but no matching agent_end followed (e.g. crash or forced shutdown). The scheduler would then show 'pending' forever even though the agent was completely idle. Now drainPendingJobs() and drainPendingWatchJobs() ask the ExtensionContext's isIdle() as a ground-truth fallback whenever agentBusy is true. If the context reports idle, we reset agentBusy = false and proceed to dispatch pending jobs instead of bailing out.
2026-05-24fix(provisioning): narrow ComfyUI install chmod to models_dir and ↵Paul Buetow
output_dir\n\nThe comfyui_install_script previously ran\n\n chmod -R 0777 File.dirname(models_dir)\n\nwhich chmods the *parent* directory (e.g. /ephemeral). If models_dir\nis configured directly under /ephemeral that gives world-write access to\nall sibling directories (vLLM hug cache, Ollama models, etc.).\n\nNow chmod only the two directories that actually need it: models_dir\nand output_dir.
2026-05-24fix(wireguard): handle leading whitespace in /etc/hosts linesPaul Buetow
In prune_host_line, body.split(/\s+/) on a line with leading whitespace produced tokens starting with an empty string, which was then shifted into as ''. This caused the rewritten /etc/hosts entry to lose its IP silently. Fix by stripping the body before splitting: body.strip.split(/\s+/). Refs: hc
2026-05-24fix(cli): avoid false VM2 abort when VM1 fails after WG step succeededPaul Buetow
In run_create_both, VM1's thread rescue unconditionally set vm1_wg_state[:error], even when the WireGuard step had already signaled success (vm1_wg_state[:done] = true). If VM2 was waiting on the condition variable at that moment, it would raise 'VM1 WireGuard setup failed' and abort needlessly. Now the rescue only sets :error when :done is still false, so a downstream VM1 failure (e.g. vLLM install) no longer leaks to VM2. Resolves agent task ic.
2026-05-24fix(watcher): remove Timeout.timeout to prevent orphaning SSH child processesPaul Buetow
Replace Timeout.timeout(15) around Open3.capture3 with SSH-level keepalive options (ServerAliveInterval=5, ServerAliveCountMax=3). Ruby's Timeout raises in a background thread but leaves the ssh process running; SSH's own timeouts self-terminate cleanly.
2026-05-24cleanupPaul Buetow
2026-05-24feat: improve task plan mode widget display and update settings versionPaul Buetow
2026-05-24fix(cli): synchronize access to errors hash in run_create_bothPaul Buetow
2026-05-24fix(provisioning): chown models_dir itself, not its parentPaul Buetow
2026-05-24fix(config): memoize detected_operator_cidr failure to avoid repeated probesPaul Buetow
When all public IP probes fail (network down, DNS broken), detect_public_operator_cidr raises HyperstackVM::Error. The old code did not cache this failure, so every call to resolved_allowed_cidrs re-ran all probes, compounding slowness. Add a rescue block in detected_operator_cidr that stores the exception in @detected_operator_cidr_error and re-raises it. On subsequent calls the cached error is re-raised immediately, preventing redundant probe retries.
2026-05-24fix(smart_photo_node): strip .orient.<ext> suffix for all image formatsPaul Buetow
2026-05-24fix(manager): only delete state file when VM deletion is confirmedPaul Buetow
Ensure Manager#delete does not wipe the state file on generic/transient API failures. The rescue now checks whether the error message indicates the VM is already gone (404, not_found, does not exist) before removing state. This prevents orphaned billable VMs after exhausted retries or transient network errors.
2026-05-24fix(photo-enhance): ensure .orient tempfiles are always cleaned up in ↵Paul Buetow
enhance_one Wrap enhance_one body in begin/ensure to unconditionally delete upload_path (and tmp_png) on every exit path, not just success. Prevents 50+ MB RAW->TIFF leaks when upload_image/submit_prompt/wait_for_output/save_with_corrections raises or when ComfyUI connection errors are caught.
2026-05-24fix(photo-enhance.rb): add open_timeout and read_timeout to all ComfyUI HTTP ↵Paul Buetow
calls
2026-05-24fix(wg1-setup.sh): escape WG_HOSTNAME in sed /etc/hosts cleanupPaul Buetow
Escape regex metacharacters in WG_HOSTNAME before embedding into the sed delete pattern so '.' (always present in hostnames like hyperstack1.wg1) is treated as a literal dot rather than a wildcard. Replace the literal-space anchor with [[:space:]] so tab-indented lines in /etc/hosts are also matched and removed correctly.
2026-04-24add qwenPaul Buetow
2026-04-24task 78: make Qwen3.6-27B the VM2 defaultPaul Buetow
2026-04-11remove the pathPaul Buetow
2026-04-11pi: point task CLI docs and matching from do to askPaul Buetow
DO_CLI_REF and resolveDoExecutable use ~/go/bin/ask; matchDoInvocation still accepts legacy do prefixes. Update README and Nemotron hints. Made-with: Cursor
2026-04-11updatePaul Buetow
2026-04-11Pi extensions: document and invoke task CLI as ~/go/bin/doPaul Buetow
Use DO_CLI_REF and resolveDoExecutable in agent-plan-mode; accept both do and ~/go/bin/do in bash guards. Ask-mode shares matchDoInvocation. Nemotron/Qwen tool discipline points to ~/go/bin/do done. Made-with: Cursor
2026-04-11Rename VM1 configs: default hyperstack-vm1.toml, Nemotron in -nemotronPaul Buetow
Move the former hyperstack-vm1-coder.toml to hyperstack-vm1.toml as the standard VM1 profile (Qwen3-Coder-Next on single GPU). Preserve the dual-H100 Nemotron-3-Super stack as hyperstack-vm1-nemotron.toml. Point create-both at hyperstack-vm1.toml and refresh README for current defaults. Made-with: Cursor
2026-04-08pi: use do CLI instead of ask for task managementPaul Buetow
Rename task-wrapper invocations and prompts from ask to do in agent-plan-mode (exec, bash guards, workflow strings), plan-mode README, ask-mode readonly-command detection, and nemotron-tool-repair discipline text. Internal helpers renamed for consistency (runDo, isSafeDoCommand). Made-with: Cursor
2026-04-06provisioner: support docker_image and pre_start_cmd for Gemma 4 startupPaul Buetow
Adds docker_image and pre_start_cmd config fields to config.rb and provisioning.rb so the Gemma 4 31B workarounds are baked in: - docker_image = "vllm/vllm-openai:nightly" (stable lacks Gemma 4 support) - pre_start_cmd = "pip install -q transformers==5.5.0" (stable pins <5) - extra_docker_env = ["CUDA_VISIBLE_DEVICES=0"] (required with --entrypoint bash) When pre_start_cmd is set, the provisioner switches to --entrypoint bash and chains the patch command before launching vLLM, so create-both works end-to-end without manual container replacement. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06hyperstack: switch to Gemma 4 31B on VM2, Qwen3-Coder-Next on VM1Paul Buetow
VM1 (hyperstack-vm1-coder.toml, renamed from hyperstack-vm1-gptoss.toml): - Default model switched from gpt-oss-120b to qwen3-coder-next - Config file renamed to reflect actual default model VM2 (hyperstack-vm2.toml): - Default model switched from qwen3-coder-next to Gemma 4 31B AWQ - Uses vLLM nightly image + transformers==5.5.0 workaround: Gemma 4 architecture is registered in transformers 5.x but vLLM stable pins <5 - max_model_len=131072 (128K context); KV cache fills ~95% of H100-80GB VRAM - Added gemma4-31b preset watcher.rb: - Add loading_status field to VmSnapshot to show live model-load progress (last relevant log line during startup instead of generic "loading" message) - fetch_vm_stats now captures both Engine 0 stats and loading-phase log lines in a single SSH call using a shell variable to avoid two docker log invocations - clean_log_line() strips vLLM PID/timestamp prefix for readable display cli.rb: update all hardcoded hyperstack-vm1-gptoss.toml references to hyperstack-vm1-coder.toml hypr.fish: replace pi-hyperstack-nemotron with pi-hyperstack-coder (VM1), add pi-hyperstack-gemma4 (VM2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27agent: use ask ids in task extensionsPaul Buetow
2026-03-26eee97223-bfde-48d7-93f5-d1bb0ecddaba add /watch commandPaul Buetow