| Age | Commit message (Collapse) | Author |
|
Document the planned opt-in "follow forks" mode that would let ior trace a
target PID and all its descendants (needed for the landlock_restrict_self
integration case, task ci0, and for tracing forking workloads as a tree).
The plan covers the BPF descendant-set map, sched_process_fork/exit hooks,
the FOLLOW_FORK gate in filter(), userland flag/seeding/assertion changes,
and explicitly requires syscall-count aggregation to roll up across the
followed tree. Add a reference comment above filter() pointing to the plan.
Plan only — not implemented.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
Root cause: the generic field matcher classifyByField only maps an arg
literally named "fd" to KindFd. Several syscalls operate on an EXISTING
fd whose tracepoint arg0 is named something else, so they fell through
to KindNull -> null_event, capturing NO descriptor and dropping the fd
they act on:
- timerfd_gettime / timerfd_settime: arg0 is "int ufd" (the timerfd)
- splice: arg0 is "int fd_in" (source fd of an in-kernel transfer)
- tee: arg0 is "int fdin" (source fd of an in-kernel transfer)
Fix: add explicit KindFd overrides for these four sys_enter_* keys to
nameOnlyKindsTable so the enter handler captures arg0, mirroring the
established epoll_wait(epfd) / mq_*(mqdes) / sendfile64(out_fd) /
copy_file_range(fd_in) precedent. splice/tee were surfaced by a systemic
sweep of tracepoint formats for fd-typed arg0 named other than "fd" that
currently classify to null; they are TransferClassified siblings of
sendfile64/copy_file_range and clearly fd-operating. The *at() family
(dfd arg0) is intentionally untouched: it is path-classified, and
timerfd_create remains the KindEventfd fd CREATOR.
Regenerated artifacts (mage generate): the four enter handlers now emit
fd_event capturing ctx->args[0] instead of null_event; exit handlers stay
UNCLASSIFIED. Updated the generated kind maps, the golden result.txt, the
classify_test expectations, and docs/syscall-tracing-plan.md (moved the
four from kind "null" to kind "fd"; families IPC/Network unchanged).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
mq_timedsend(2)/mq_send(3) return 0 on success or -1 on error; the
payload size msg_len is an INPUT argument, never the return value. It was
wrongly listed in retClassifications as WriteClassified, which made
bytesFromRet attribute its 0 return as "bytes written" (the stats engine
WriteClassified path). Remove it so its return stays UNCLASSIFIED,
consistent with its POSIX mq sibling mq_timedreceive (which legitimately
stays ReadClassified because it returns the received byte count). This is
the exact same defect just fixed for SysV msgsnd (5057bd9) and mirrors
the msgrcv/msgsnd asymmetry.
Regenerated tracepoints/docs accordingly and updated the pre-existing
classify unit test and the TestPosixMqBasic integration assertion: the
mq_timedsend send no longer asserts a write byte count (now expects 0),
while mq_timedreceive keeps its received-byte-count assertion. Verified:
mage generate idempotent, mage build OK, internal/generate tests pass.
TestPosixMqBasic skips in this sandbox (mq_open: permission denied) but
compiles with the corrected assertions.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
msgsnd(2) returns 0 on success or -1 on error; the payload size msgsz is
an INPUT argument, never the return value. It was wrongly listed in
retClassifications as WriteClassified, which made the stats engine treat
its 0 return as "bytes written". Remove it so its return stays
UNCLASSIFIED, consistent with its SysV IPC siblings (msgrcv legitimately
stays ReadClassified because it returns a received byte count).
Regenerated tracepoints/docs accordingly. Verified: mage generate
idempotent, mage build OK, internal/generate tests pass, and the
TestSysVMsgBasic integration test (added in task 7i0) still passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
listxattrat(2) (Linux 6.13+) returns the size in bytes of the list of
extended attribute names, exactly like listxattr/llistxattr/flistxattr,
but its exit was classified UNCLASSIFIED, so its read bytes were dropped
from I/O totals. Classify it as ReadClassified and regenerate the BPF
handler (ret_type now READ_CLASSIFIED). This mirrors the getxattrat fix
(task ku, commit c3177bd) and completes xattr-family consistency:
get-family and list-family are READ_CLASSIFIED while set-family and
remove-family stay UNCLASSIFIED (they return 0/-1).
Update the docs ReadClassified list and the retclassify expectation, and
add an ioworkload scenario plus integration test: the workload sets a
user xattr then lists names via the raw listxattrat(2) syscall with
AT_FDCWD, and the test asserts enter_listxattrat captures the file path
and accounts the returned name-list size as read bytes.
Task: r20
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
getxattrat(2) (Linux 6.13+) returns the xattr value size in bytes,
exactly like getxattr/lgetxattr/fgetxattr, but its exit was classified
UNCLASSIFIED, so its read bytes were dropped from I/O totals. Classify
it as ReadClassified and regenerate the BPF handler (ret_type now
READ_CLASSIFIED). Path extraction (args[1], after the dirfd) and the
name-not-captured-as-path behaviour were already correct.
Update the docs ReadClassified list and the retclassify expectation,
and add the first xattr integration coverage: an ioworkload scenario
that sets then getxattrat-reads a user xattr on tmpfs, plus a test that
asserts enter_getxattrat captures the file path (not the xattr name)
and accounts the returned value size as read bytes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
rt_sigreturn(2) restores the pre-signal execution context off the signal
stack frame and resumes the interrupted instruction; it never returns to
the instruction after the syscall. man sigreturn(2) states plainly that
"sigreturn() never returns", and tracing against /sys/kernel/tracing
confirms it: sys_enter_rt_sigreturn fires once per signal-handler return
while sys_exit_rt_sigreturn never fires.
The generator previously emitted a dead handle_sys_exit_rt_sigreturn (it
can never run) and recorded a per-tid syscall_enter_state_map entry on the
enter path that nothing would ever delete (no exit fires), leaking entries
in the bounded map on every signal-handler return.
Add rt_sigreturn to noreturnSyscalls so codegen suppresses the dead exit
handler and routes the enter handler through ior_on_noreturn_syscall_enter
(sampling decision only, no map write), exactly like exit/exit_group. The
enter null_event is still emitted, and the FamilySignals/KindNull
classification is unchanged. Regenerated the C/Go artifacts and the result
baseline accordingly, and generalized the related comments.
Lock-in tests: TestRtSigreturnIsNoreturn asserts rt_sigreturn is noreturn;
TestRtSigSiblingsAreNotNoreturn guards that the returning rt_sig* siblings
are not; TestGenerateExitNoreturnHandlers now also covers rt_sigreturn.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
sendfile64(out_fd, in_fd, offset, count) transfers bytes between two file
descriptors in the kernel and returns the number of bytes written to out_fd.
Its tracepoint fields carry no field literally named "fd", so it fell through
to KindNull and captured no descriptor at all - inconsistent with its sibling
copy_file_range (KindFd) and the read/write/sendto/recvfrom families.
Add an explicit sys_enter_sendfile64 -> KindFd override that captures out_fd
(args[0], the destination the bytes are written to), matching the single-fd
KindFd convention. The return value stays TransferClassified, consistent with
copy_file_range/splice/tee/vmsplice. Family stays Network (sendfile is
historically socket-oriented; copy_file_range=FS is pure file-to-file).
Update docs/syscall-tracing-plan.md (move sendfile64 from null to fd kind),
regenerate C/Go artifacts, fix the phase-A classify assertion, and add
TestClassifySendfile64CapturesOutFd as a lock-in + negative test. The existing
TestRetbytesPhaseA integration test still passes with the runtime change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
clock_nanosleep with the TIMER_ABSTIME flag passes an ABSOLUTE wakeup
time in the request timespec, not a relative duration. The generated BPF
sleep handler computed requested_ns = tv_sec*1e9 + tv_nsec
unconditionally, so absolute sleeps exported a bogus multi-decade
"sleep duration" in CSV/parquet/stream.
generateExtraSleep now carries an optional flags-argument expression per
sleep syscall. For clock_nanosleep the generated handler checks
args[1] & TIMER_ABSTIME (value 1) and only computes the relative
duration when the flag is clear; absolute sleeps keep the existing -1
sentinel (same value used for null/unreadable timespec pointers).
nanosleep is always relative and stays unconditional (no flags arg).
- Regenerated internal/c/generated_tracepoints.c (mage generate idempotent).
- Added codegen tests asserting the TIMER_ABSTIME guard for clock_nanosleep
and its absence for nanosleep.
- Extended the ioworkload sleep scenario to issue an absolute clock_nanosleep
and the sleep parquet integration test to assert it is reported as -1.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
After p10 suppressed the sys_exit_exit/sys_exit_exit_group handlers, the
enter handlers for exit/exit_group still called ior_on_syscall_enter,
which writes a per-tid entry into syscall_enter_state_map. With the exit
handler gone, nothing ever bpf_map_delete_elem'd that entry, so stale
per-tid state accumulated in the bounded (32768) map on hosts churning
many distinct tids and could starve legitimate inserts.
Add ior_on_noreturn_syscall_enter in internal/c/filter.c: it only makes
the sampling decision (ior_should_emit_trace) and deliberately does NOT
record enter-state. The code generator now emits this hook for noreturn
enter handlers (detected via isNoreturnSyscall(syscallName(name))) so the
enter null_event is still emitted while the dead, unreclaimable map write
is skipped. Regenerated generated_tracepoints.c accordingly.
Extend TestGenerateExitNoreturnHandlers with a negative assertion (no
ior_on_syscall_enter for noreturn) and add
TestGenerateReturningSyscallEnterRecordsState as a positive contrast.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
exit and exit_group never return to userspace, so their sys_exit
tracepoints can never fire. The generator previously emitted matching
EXIT_RET_EVENT handlers anyway, producing dead code in the generated BPF
program. classifySyscall now skips exit-handler emission for noreturn
syscalls via isNoreturnSyscall, and the regenerated artifacts drop the
sys_exit_exit / sys_exit_exit_group handlers (enter handlers are kept).
Tests updated to match the new reality:
- TestGenerateExitNoreturnHandlers asserts no exit handler is emitted.
- TestClassifySyscallPairEmitsAllFamilies exempts noreturn syscalls from
the exit-handler-required assertion while staying strict for all others.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
close_range was captured as a single-fd fd_event carrying only first, so
the runtime evicted every tracked fd >= first, ignoring the last upper
bound and the flags. Bounded calls wrongly dropped still-open higher fds,
and CLOSE_RANGE_CLOEXEC (which keeps fds open) was treated as a full close.
Reclassify close_range to the two_fd_event kind, mapping fd_a/fd_b/extra to
first/last/flags. The runtime now closes only the inclusive [first, last]
range (a negative last from ~0U means unbounded) and skips eviction when
CLOSE_RANGE_CLOEXEC is set or the syscall fails.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
epoll_create(size) was recording size (args[0]) as flags — hardcode to
0 since the syscall has no flags argument. pidfd_open(pid, flags) was
recording pid (args[0]) as flags — use args[1] instead.
Add test fixtures and codegen tests that verify the correct argument
indexes and reject the old wrong ones. Regenerate generated_tracepoints.c.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
Generated exit handlers now pass the explicit enter trace ID
(SYS_ENTER_X) to ior_on_syscall_exit instead of relying on the
implicit enter_id == exit_id + 1 arithmetic invariant. filter.c
compares directly against the passed enter ID.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Introduces a Docker-based build path so ior can be compiled on any
Linux host without a native Rocky 9 toolchain setup:
- Dockerfile: Rocky 9 minimal image with Go (version from ARG, default
from go.mod), static libelf/libzstd built from source, libbpfgo at
v0.9.2-libbpf-1.5.1, and mage; CMD runs mage generate + mage all
against the repo root mounted as a volume.
- scripts/build-with-docker.sh: reads GO_VERSION from go.mod, passes it
as --build-arg to docker build, mounts tracefs and BTF into the
container, writes the binary to the repo root.
- Magefile.go: adds BuildDocker target that wraps the script.
- README.md: simplified to the two build paths (Docker + native) with
links to docs/; removed GOTOOLCHAIN=auto throughout.
- docs/build-rocky-linux-9.md: full manual Rocky 9 steps, libbpfgo
toolchain setup/rollback, compile-once-run-everywhere explanation,
and timing semantics.
- docs/tui-reference.md: complete TUI hotkey reference, recording mode
details, and the .ior.zst vs Parquet trade-off table.
- AGENTS.md: removed GOTOOLCHAIN=auto from all build commands.
- internal/c/generated_tracepoints.c: regenerated against the host kernel.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
The BPF handler generator emitted struct trace_event_raw_sys_enter/
trace_event_raw_sys_exit (the BTF-blessed aliases). RHEL 9 carries an
rt-tree backport that adds preempt_lazy_count to struct trace_entry,
which widens those aliases by 8 bytes and shifts args/ret. The actual
tracepoint context the kernel hands the program is still
syscall_trace_enter / syscall_trace_exit, where the offsets did not
move. Programs typed against the wider alias read past max_ctx_offset
and the verifier rejects the attach with EACCES.
Switching the generator to emit syscall_trace_enter/exit lines up with
the real context on RHEL 9 (and is identical on every other distro,
since the two structs only diverge there). Same fix bcc shipped in
iovisor/bcc#4920 and inspektor-gadget did in inspektor-gadget#2546.
Field accesses (ctx->args[N], ctx->ret) are unchanged.
Verified end-to-end on Rocky Linux 9.7 stock 5.14.0-611.5.1.el9_7
(no kernel-ml needed) and Fedora 6.19. README rewritten accordingly:
drops the elrepo kernel-ml step and the trailing 'permission denied'
troubleshooting paragraph; adds a historical note explaining why the
old workaround existed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
|
|
|