summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPaul Buetow <paul@buetow.org>2026-06-10 07:54:55 +0300
committerPaul Buetow <paul@buetow.org>2026-06-10 07:54:55 +0300
commit9dac4b33948f441ec645a8ec491878085483aeb6 (patch)
tree53eb3a496e9d96ab8fbae4167a39064ccac61891
parentc61fb1f71a72d66960914877e8f0a24638c85324 (diff)
docs(follow-forks): add process-tree-following plan + filter.c reference
Document the planned opt-in "follow forks" mode that would let ior trace a target PID and all its descendants (needed for the landlock_restrict_self integration case, task ci0, and for tracing forking workloads as a tree). The plan covers the BPF descendant-set map, sched_process_fork/exit hooks, the FOLLOW_FORK gate in filter(), userland flag/seeding/assertion changes, and explicitly requires syscall-count aggregation to roll up across the followed tree. Add a reference comment above filter() pointing to the plan. Plan only — not implemented. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
-rw-r--r--docs/follow-forks-plan.md200
-rw-r--r--internal/c/filter.c8
2 files changed, 208 insertions, 0 deletions
diff --git a/docs/follow-forks-plan.md b/docs/follow-forks-plan.md
new file mode 100644
index 0000000..1f611ce
--- /dev/null
+++ b/docs/follow-forks-plan.md
@@ -0,0 +1,200 @@
+# Process-Tree Following ("Follow Forks"): Implementation Plan
+
+Status: **planned, not implemented.** This document is a design/implementation
+plan only. No code in this plan has been written yet.
+
+## Motivation
+
+Today ior traces exactly one TGID. `internal/c/filter.c` `filter()` accepts a
+syscall only when the current process group id equals the load-time
+`PID_FILTER` global (or `-1` for trace-all). Children that a workload forks/execs
+have a different TGID and are dropped **in-kernel**, before any userspace comm
+filtering can see them.
+
+This blocks any workload that does meaningful work in child processes:
+
+- `landlock_restrict_self` integration coverage (task `ci0`): the syscall
+ irreversibly sandboxes its calling process, so it can only be exercised in a
+ short-lived child; that child's syscalls are currently filtered out.
+- More broadly: shells, `make`, CI pipelines, and forking servers cannot be
+ traced as a unit.
+
+Following forks makes ior trace a target PID **and all its descendants** as a
+tree, while leaving the default single-PID behavior unchanged.
+
+## Requirements
+
+1. Opt-in via a new `-follow-forks` flag. Default OFF → existing behavior and all
+ existing tests are byte-for-byte unaffected.
+2. When ON, trace the root PID and every descendant created after attach
+ (fork/clone), including across `exec` (which preserves the TGID).
+3. **Syscall-count aggregation must still cover the whole followed tree.** The
+ per-syscall aggregate counts/durations/histograms in `syscall_aggregate_map`
+ (maintained by `ior_update_syscall_aggregate` in `filter.c`) must roll up
+ syscalls issued by descendants, not just the root. Note the existing
+ invariant to preserve: aggregate counting is independent of per-event
+ sampling — `ior_on_syscall_exit` calls `ior_update_syscall_aggregate`
+ regardless of the sampling/`emit_event` decision, so even syscalls in
+ aggregate-only (sampling-rate 0) mode still count. Following forks must not
+ regress this: a descendant whose individual events are sampled down must
+ still contribute to the counts.
+4. Bounded resource use: descendant set lives in a fixed-size BPF map, reclaimed
+ on process exit; pathological fork storms degrade gracefully (best-effort),
+ they do not crash the tracer.
+
+## Design overview
+
+A new BPF hash map holds the set of traced TGIDs. Two `sched` tracepoints keep it
+current (add child on fork, remove on process exit). `filter()` consults the set
+when a new `FOLLOW_FORK` global is on. The whole feature is gated off by default.
+
+```
+ sched_process_fork sched_process_exit
+ (parent ctx, pre-child) (leader exit only)
+ │ add child tgid │ delete tgid
+ ▼ ▼
+ ┌──────────────────────────────────────────────────────────────┐
+ │ traced_pid_map (hash) │
+ │ key: __u32 tgid value: __u8 │
+ └──────────────────────────────────────────────────────────────┘
+ ▲ seeded with root PID at startup
+ │ consulted (one lookup) per syscall when
+ │ FOLLOW_FORK == 1
+ filter() → ACCEPT / FILTER
+ │
+ ▼ ACCEPT covers BOTH event emission AND
+ ior_update_syscall_aggregate (count rollup)
+```
+
+Because `sched_process_fork` fires in the parent's context **before the child
+runs**, the child is enrolled in the map before it executes its first syscall.
+This is what makes following reliable. `exec` does not change the TGID, so a
+re-exec'd child keeps its membership — exactly what the `ci0` landlock case
+needs.
+
+Putting the descendant check inside `filter()` (the single gate in front of both
+event emission and `ior_update_syscall_aggregate`) is what satisfies requirement
+3 for free: any accepted descendant flows into the same aggregate-counting path
+as the root.
+
+## Changes by layer
+
+### 1. BPF data plane (`internal/c/`)
+
+- **`maps.h`** — add `traced_pid_map`:
+ `BPF_MAP_TYPE_HASH`, key `__u32` (tgid), value `__u8`, `max_entries` ~8192
+ (resizable like `event_map`). Consider `BPF_MAP_TYPE_LRU_HASH` as a safety
+ valve against fork-storm exhaustion (auto-evicts coldest entries instead of
+ failing inserts).
+
+- **New `follow_fork.c`** (or appended into `filter.c`):
+ - `const volatile __u32 FOLLOW_FORK;` global (set at load time, mirrors
+ `PID_FILTER`).
+ - `SEC("tracepoint/sched/sched_process_fork")`: read `child_pid` from the
+ tracepoint context (use the tracepoint format / CO-RE as the syscall
+ handlers already do); if the parent's TGID is in `traced_pid_map`, insert
+ the child TGID. Only act when `FOLLOW_FORK == 1`.
+ - `SEC("tracepoint/sched/sched_process_exit")`: delete the TGID, but **only on
+ thread-group-leader exit** (`pid == tgid`) so per-thread exits don't evict a
+ still-live process.
+
+- **`filter()`** — when `FOLLOW_FORK == 1`, additionally `ACCEPT` if
+ `bpf_map_lookup_elem(&traced_pid_map, &tgid)` hits. Keep the existing
+ `IOR_PID_FILTER` self-exclusion and the `PID_FILTER == -1` trace-all path.
+ This is one extra hot-path lookup, gated by the flag (negligible when off).
+
+### 2. BPF control plane (Go, `internal/`)
+
+- **`bpfsetup.go`** — set the `FOLLOW_FORK` global in `setBPFGlobals` (mirror
+ `PID_FILTER`); add `traced_pid_map` to `resizeBPFMaps` if made resizable.
+- **Seeding** — after `BPFLoadObject`, when follow-forks is on, insert the root
+ `cfg.PidFilter` into `traced_pid_map` via the libbpfgo `Map.Update`. New small
+ helper, called from `setupBPFModule`.
+- **Attach the sched hooks** — in `setupBPFModule`, after `mgr.AttachAll`,
+ directly attach the two programs via
+ `GetProgram(...).AttachTracepoint("sched", "sched_process_fork" / "sched_process_exit")`.
+ Keep them out of the syscall selector / TUI probe state (they are always-on
+ plumbing, not user-selectable). Retain their `Link`s for clean teardown.
+
+### 3. Userland filter (`internal/flags/`)
+
+- **`flags.go`** — add `FollowFork bool` (default false) and a `-follow-forks`
+ CLI flag.
+- **`tracefilter.go`** — when `FollowFork` is on, **do not** set the userland PID
+ `Eq` filter (currently `tracefilter.go:26`, `cfg.PidFilter > 0`). The kernel
+ already scopes to the tree; a userland `pid == root` equality filter would
+ wrongly drop legitimate descendant records.
+
+### 4. Test harness (`integrationtests/`)
+
+- **`harness.go`** — add an opt-in run mode that passes `-follow-forks` (via the
+ existing `extraIorArgs` path) and seeds the root with the workload PID.
+- **`expectations.go`** — add a tree/comm-aware assertion, e.g.
+ `AssertPidsWithinTree(result, rootPID, allowedComms...)` or a comm-scoped
+ `AssertOnlyComm(result, "ioworkload")`, since descendant PIDs are legitimately
+ `!= root`. The existing `AssertNoUnexpectedPID` (expectations.go:81) stays
+ as-is for normal (non-tree) tests.
+
+### 5. Feature validation + `ci0`
+
+- **New `follow_fork_test.go`** — a scenario that forks+execs a child issuing a
+ distinctive syscall. Assert:
+ - **with** follow-forks: the child's syscall **is** captured, and the child's
+ syscalls contribute to the syscall-count aggregate (requirement 3);
+ - **without** follow-forks: the child's syscall is **not** captured (proves the
+ default is unchanged).
+- **`ci0`** — scenario re-execs an ioworkload child subcommand that does
+ `landlock_create_ruleset → landlock_restrict_self(rf, 0) → exit`; the parent is
+ never sandboxed. The test runs with follow-forks and a comm-scoped assertion
+ for `enter_landlock_restrict_self`. ~30 min once the infra above exists.
+
+## Effort estimate: ~2.5–4 days
+
+| Piece | Est. |
+|---|---|
+| BPF map + 2 sched hooks + filter change | 0.5–1d |
+| Go: flag, global, map seeding, attach, userland filter bypass | 0.5d |
+| Harness mode + tree/comm assertions + feature integration test | 0.5–1d |
+| `ci0` scenario + test | 0.25d |
+| Verifier / edge-case buffer (thread-vs-process exit, fork tracepoint field offsets, map sizing) | 0.5–1d |
+
+## Risks & mitigations
+
+- **Behavior regression** → default-off `FOLLOW_FORK` global; zero change to
+ existing code paths when off. Gate sign-off on the full suite staying green.
+- **`sched_process_fork` field extraction** (`child_pid` offset) → use the
+ tracepoint format / CO-RE, consistent with the existing syscall handlers.
+- **Thread vs process exit eviction** → guard the delete on `pid == tgid`
+ (leader only) so a thread exit never drops a live process.
+- **Map exhaustion under heavy forking** → bounded hash with exit-hook reclaim;
+ optionally `LRU_HASH`. Best-effort is acceptable for a tracer.
+- **Verifier cost** → a single extra map lookup on the hot path, flag-gated; low.
+- **Count-aggregation correctness (requirement 3)** → keep the descendant check
+ inside `filter()` so accepted descendants share the root's
+ `ior_update_syscall_aggregate` path; add an explicit assertion in
+ `follow_fork_test.go` that descendant syscalls increment the aggregate counts.
+
+## Sequencing
+
+1. BPF map + `filter()` `FOLLOW_FORK` branch (off by default) + Go global →
+ confirm the suite is still green.
+2. Add the sched hooks + map seeding + attach.
+3. Feature integration test (on/off, including the count-rollup assertion) →
+ proves the mechanism.
+4. Userland filter bypass + harness tree mode + assertions.
+5. `ci0` scenario + test.
+
+Steps 1–3 deliver the reusable feature; steps 4–5 consume it. Each step is
+independently verifiable.
+
+## Source-of-truth references
+
+- `internal/c/filter.c` — `filter()`, `ior_on_syscall_exit`,
+ `ior_update_syscall_aggregate` (the count path that must cover the tree).
+- `internal/c/maps.h` — map declarations (where `traced_pid_map` is added).
+- `internal/bpfsetup.go` — `setBPFGlobals` (`PID_FILTER` etc.), `resizeBPFMaps`.
+- `internal/ior_bpfsetup.go` — `setupBPFModule` attach flow; `AttachTracepoint`.
+- `internal/flags/flags.go`, `internal/flags/tracefilter.go` — flag surface and
+ userland PID filtering.
+- `integrationtests/harness.go`, `integrationtests/expectations.go` —
+ `AssertNoUnexpectedPID` and the run modes a tree-aware test needs.
diff --git a/internal/c/filter.c b/internal/c/filter.c
index 5440bcc..66c6574 100644
--- a/internal/c/filter.c
+++ b/internal/c/filter.c
@@ -120,6 +120,14 @@ static __always_inline int ior_on_syscall_exit(__u32 tid, __u32 enter_trace_id,
return emit_event != 0;
}
+// filter() decides whether the current task's syscall is in scope. Today this is
+// a single-TGID gate (PID_FILTER, with -1 meaning trace-all) plus an optional
+// TID_FILTER. ior does NOT follow forks: a traced process's children run under a
+// different TGID and are excluded here, which also means their syscalls miss the
+// aggregate-count path downstream. A planned opt-in process-tree-following mode
+// would extend this gate to also accept descendant TGIDs from a BPF-maintained
+// set seeded with the root PID and updated via sched_process_fork/exit — see
+// docs/follow-forks-plan.md for the full design.
static __always_inline int filter(__u32 *pid, __u32 *tid) {
u64 pid_tgid = bpf_get_current_pid_tgid();
*pid = pid_tgid >> 32;