moving the plan

author: Paul Buetow <paul@buetow.org> 2026-03-19 08:46:12 +0200
committer: Paul Buetow <paul@buetow.org> 2026-03-19 08:46:12 +0200
commit: 928791f23b69d11441608fd48b317fe28a7ac89d (patch)
tree: d4d7ecb216f575cf406a9a83bfe0fd998597d9f2 /CLICKHOUSE.md
parent: cd554b0af706b5f62b4e1bfde04091052b4aac61 (diff)
1 files changed, 0 insertions, 1623 deletions
diff --git a/CLICKHOUSE.md b/CLICKHOUSE.md
deleted file mode 100644
index 785d8f4..0000000
--- a/CLICKHOUSE.md
+++ /dev/null
@@ -1,1623 +0,0 @@
-# ClickHouse Integration Plan for IOR
-
-This document outlines the implementation plan for integrating ClickHouse database output into IOR, supporting both real-time streaming and batch file export.
-
-## Overview
-
-| Mode | Format | Use Case | Expected Throughput |
-|------|--------|----------|---------------------|
-| **Streaming** | Native TCP protocol | Real-time ingestion to ClickHouse | 100K-1M events/sec |
-| **File Dump** | Parquet | Batch export for later import | Same (offline processing) |
-
-## Architecture
-
-### Current Data Flow
-```
-BPF → Ring Buffer → eventLoop → event.Pair → [flamegraph workers | console output]
-```
-
-### Proposed Data Flow
-```
-BPF → Ring Buffer → eventLoop → event.Pair → [clickhouse-stream | parquet-writer | flamegraph | console]
-```
-
----
-
-## Part 1: ClickHouse Streaming (Native TCP Protocol)
-
-### 1.1 Dependencies
-
-Add to `go.mod`:
-```go
-require (
-    github.com/ClickHouse/clickhouse-go/v2 v2.23.0
-)
-```
-
-### 1.2 New Package Structure
-
-```
-internal/
-├── clickhouse/
-│   ├── client.go        # Connection management, connection pooling
-│   ├── schema.go        # Table schema definitions and DDL
-│   ├── writer.go        # Batch writer with buffering
-│   ├── config.go        # Configuration (host, port, database, table)
-│   └── client_test.go   # Unit tests
-```
-
-### 1.3 ClickHouse Table Schema
-
-```sql
-CREATE TABLE ior_events (
-    timestamp_ns    UInt64,           -- Event timestamp (nanoseconds)
-    pid             UInt32,           -- Process ID (high cardinality, no LowCardinality)
-    tid             UInt32,           -- Thread ID (high cardinality, no LowCardinality)
-    comm            LowCardinality(String),
-    syscall_name    LowCardinality(String),
-    trace_id        UInt32,
-    event_type      UInt8,            -- ENTER_OPEN_EVENT, EXIT_OPEN_EVENT, etc.
-    
-    -- Result
-    ret_value       Int64,            -- Syscall return value
-    ret_type        UInt32,           -- Return type classification
-    
-    -- File information
-    fd              Int32,
-    filename        String,
-    pathname        String,
-    oldname         String,
-    newname         String,
-    
-    -- Flags and metadata
-    flags           Int32,
-    
-    -- Calculated fields (from event.Pair)
-    duration_ns     UInt64,           -- Duration of syscall
-    duration_to_prev_ns UInt64,       -- Time since previous syscall
-    
-    -- Additional context
-    hostname        LowCardinality(String),
-    collection_id   UUID,             -- Groups events from same collection run
-    ingested_at     DateTime64(3) DEFAULT now64(3),
-
-    -- Secondary indices for high-cardinality fields
-    INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4,
-    INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4
-)
-ENGINE = MergeTree()
-PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns))
-ORDER BY (timestamp_ns, pid, tid)
-SETTINGS index_granularity = 8192;
-```
-
-**Cardinality Considerations:**
-| Field | Cardinality | Encoding | Reason |
-|-------|-------------|----------|--------|
-| `pid` | High (thousands) | Plain UInt32 | PIDs can range into thousands per server; bloom filter index for point lookups |
-| `tid` | Very High (tens of thousands) | Plain UInt32 | TIDs are numerous in threaded workloads; bloom filter index for point lookups |
-| `comm` | Low (hundreds) | LowCardinality | Limited number of unique process names |
-| `syscall_name` | Very Low (~100) | LowCardinality | Fixed set of syscalls |
-| `hostname` | Very Low | LowCardinality | Usually single host per collection |
-| `filename` | Medium-High | Plain String | Depends on workload; could use token bloom filter |
-
-**Optimization Notes:**
-- **Bloom filter indices** on `pid` and `tid` enable efficient point lookups on these high-cardinality fields without bloating storage
-- `LowCardinality` only for truly low-cardinality fields (`comm`, `syscall_name`, `hostname`)
-- Partitioning by day for efficient time-based queries and TTL
-- Ordering by `(timestamp_ns, pid, tid)` for time-range queries and per-process/thread analysis
-- `collection_id` UUID to group events from the same tracing session
-
-### 1.4 Implementation Details
-
-#### 1.4.1 Configuration (`internal/clickhouse/config.go`)
-
-```go
-package clickhouse
-
-type Config struct {
-    Host         string        // ClickHouse host (default: localhost)
-    Port         int           // ClickHouse port (default: 9000)
-    Database     string        // Database name (default: ior)
-    Table        string        // Table name (default: ior_events)
-    User         string        // Username
-    Password     string        // Password
-    BatchSize    int           // Events per batch (default: 10000)
-    FlushTimeout time.Duration // Max time before flush (default: 1s)
-    MaxOpenConns int           // Connection pool size (default: 4)
-    Async        bool          // Enable async inserts (default: true for high throughput)
-}
-
-func DefaultConfig() Config {
-    return Config{
-        Host:         "localhost",
-        Port:         9000,
-        Database:     "ior",
-        Table:        "ior_events",
-        BatchSize:    10000,
-        FlushTimeout: time.Second,
-        MaxOpenConns: 4,
-        Async:        true,
-    }
-}
-
-func ConfigFromFlags() Config {
-    // Read from command-line flags
-}
-```
-
-#### 1.4.2 Client (`internal/clickhouse/client.go`)
-
-```go
-package clickhouse
-
-import (
-    "context"
-    "github.com/ClickHouse/clickhouse-go/v2"
-    "github.com/ClickHouse/clickhouse-go/v2/lib/driver"
-)
-
-type Client struct {
-    conn   driver.Conn
-    config Config
-}
-
-func NewClient(ctx context.Context, config Config) (*Client, error) {
-    conn, err := clickhouse.Open(&clickhouse.Options{
-        Addr: []string{fmt.Sprintf("%s:%d", config.Host, config.Port)},
-        Auth: clickhouse.Auth{
-            Database: config.Database,
-            Username: config.User,
-            Password: config.Password,
-        },
-        MaxOpenConns:   config.MaxOpenConns,
-        MaxIdleConns:   config.MaxOpenConns,
-        ConnMaxLifetime: time.Hour,
-        DialTimeout:    time.Second * 10,
-        Settings: clickhouse.Settings{
-            "max_execution_time": 60,
-        },
-        Compression: &clickhouse.Compression{
-            Method: clickhouse.CompressionLZ4,
-        },
-        BlockBufferSize: 10,
-    })
-    if err != nil {
-        return nil, err
-    }
-    return &Client{conn: conn, config: config}, nil
-}
-
-func (c *Client) Close() error {
-    return c.conn.Close()
-}
-```
-
-#### 1.4.3 Batch Writer (`internal/clickhouse/writer.go`)
-
-Key design for 100K-1M events/sec:
-- **Double buffering**: While one buffer is being sent, the other is being filled
-- **Sync.Pool for event rows**: Reduce GC pressure
-- **Async inserts**: ClickHouse async mode for lower latency
-- **Connection pooling**: Multiple connections for parallel inserts
-
-```go
-package clickhouse
-
-import (
-    "context"
-    "sync"
-    "time"
-    
-    "ior/internal/event"
-)
-
-type EventRow struct {
-    TimestampNs     uint64
-    Pid             uint32
-    Tid             uint32
-    Comm            string
-    SyscallName     string
-    TraceId         uint32
-    EventType       uint8
-    RetValue        int64
-    RetType         uint32
-    Fd              int32
-    Filename        string
-    Pathname        string
-    Oldname         string
-    Newname         string
-    Flags           int32
-    DurationNs      uint64
-    DurationToPrev  uint64
-    Hostname        string
-    CollectionId    string
-}
-
-type Writer struct {
-    client       *Client
-    config       Config
-    
-    bufferMu     sync.Mutex
-    buffer       []*EventRow
-    bufferSize   int
-    
-    flushTimer   *time.Timer
-    flushCh      chan struct{}
-    
-    ctx          context.Context
-    cancel       context.CancelFunc
-    wg           sync.WaitGroup
-    
-    pool         sync.Pool
-}
-
-func NewWriter(ctx context.Context, client *Client, config Config) *Writer {
-    ctx, cancel := context.WithCancel(ctx)
-    w := &Writer{
-        client:     client,
-        config:     config,
-        buffer:     make([]*EventRow, 0, config.BatchSize),
-        flushCh:    make(chan struct{}, 1),
-        ctx:        ctx,
-        cancel:     cancel,
-        pool: sync.Pool{
-            New: func() interface{} { return &EventRow{} },
-        },
-    }
-    w.flushTimer = time.AfterFunc(config.FlushTimeout, w.triggerFlush)
-    go w.flushLoop()
-    return w
-}
-
-func (w *Writer) Write(ep *event.Pair) error {
-    row := w.pool.Get().(*EventRow)
-    w.populateRow(row, ep)
-    
-    w.bufferMu.Lock()
-    w.buffer = append(w.buffer, row)
-    shouldFlush := len(w.buffer) >= w.config.BatchSize
-    w.bufferMu.Unlock()
-    
-    if shouldFlush {
-        w.triggerFlush()
-    }
-    return nil
-}
-
-func (w *Writer) triggerFlush() {
-    select {
-    case w.flushCh <- struct{}{}:
-    default:
-    }
-}
-
-func (w *Writer) flushLoop() {
-    for {
-        select {
-        case <-w.flushCh:
-            w.flush()
-        case <-w.ctx.Done():
-            w.flush()
-            return
-        }
-    }
-}
-
-func (w *Writer) flush() {
-    w.bufferMu.Lock()
-    if len(w.buffer) == 0 {
-        w.bufferMu.Unlock()
-        return
-    }
-    buffer := w.buffer
-    w.buffer = make([]*EventRow, 0, w.config.BatchSize)
-    w.bufferMu.Unlock()
-    
-    ctx := clickhouse.Context(w.ctx, clickhouse.WithSettings(clickhouse.Settings{
-        "async_insert":          1,
-        "wait_for_async_insert": 0,
-    }))
-    
-    batch, err := w.client.conn.PrepareBatch(ctx, 
-        "INSERT INTO ior_events (timestamp_ns, pid, tid, comm, syscall_name, ...)")
-    if err != nil {
-        return
-    }
-    
-    for _, row := range buffer {
-        batch.Append(
-            row.TimestampNs, row.Pid, row.Tid, row.Comm,
-            row.SyscallName, row.TraceId, row.EventType,
-            row.RetValue, row.RetType, row.Fd, row.Filename,
-            row.Pathname, row.Oldname, row.Newname, row.Flags,
-            row.DurationNs, row.DurationToPrev, row.Hostname,
-            row.CollectionId,
-        )
-        w.pool.Put(row)
-    }
-    
-    batch.Send()
-}
-
-func (w *Writer) Close() {
-    w.cancel()
-    w.wg.Wait()
-    w.flushTimer.Stop()
-}
-```
-
-#### 1.4.4 Schema Management (`internal/clickhouse/schema.go`)
-
-```go
-package clickhouse
-
-func (c *Client) CreateTableIfNotExists(ctx context.Context) error {
-    ddl := `CREATE TABLE IF NOT EXISTS ior_events (...)`
-    return c.conn.Exec(ctx, ddl)
-}
-```
-
-### 1.5 Integration with Event Loop
-
-Modify `internal/eventloop.go`:
-
-```go
-func (e *eventLoop) run(ctx context.Context, rawCh <-chan []byte) {
-    var chWriter *clickhouse.Writer
-    if flags.Get().ClickHouseEnable {
-        chClient, err := clickhouse.NewClient(ctx, clickhouse.ConfigFromFlags())
-        if err != nil {
-            panic(err)
-        }
-        defer chClient.Close()
-        chWriter = clickhouse.NewWriter(ctx, chClient, clickhouse.ConfigFromFlags())
-        defer chWriter.Close()
-    }
-    
-    for ep := range e.events(ctx, rawCh) {
-        switch {
-        case flags.Get().ClickHouseEnable:
-            chWriter.Write(ep)
-            ep.Recycle()
-        case flags.Get().FlamegraphEnable:
-            e.flamegraph.Ch <- ep
-        // ... rest
-        }
-    }
-}
-```
-
-### 1.6 New Command-Line Flags
-
-Add to `internal/flags/flags.go`:
-
-```go
-type Flags struct {
-    // ... existing fields ...
-    
-    // ClickHouse streaming
-    ClickHouseEnable   bool
-    ClickHouseHost     string
-    ClickHousePort     int
-    ClickHouseDatabase string
-    ClickHouseTable    string
-    ClickHouseUser     string
-    ClickHouseBatchSize int
-    ClickHouseAsync    bool
-}
-
-func parse() {
-    // ... existing flags ...
-    
-    flag.BoolVar(&singleton.ClickHouseEnable, "clickhouse", false, 
-        "Enable ClickHouse streaming output")
-    flag.StringVar(&singleton.ClickHouseHost, "ch-host", "localhost", 
-        "ClickHouse host")
-    flag.IntVar(&singleton.ClickHousePort, "ch-port", 9000, 
-        "ClickHouse native port")
-    flag.StringVar(&singleton.ClickHouseDatabase, "ch-db", "ior", 
-        "ClickHouse database")
-    flag.StringVar(&singleton.ClickHouseTable, "ch-table", "ior_events", 
-        "ClickHouse table")
-    flag.StringVar(&singleton.ClickHouseUser, "ch-user", "", 
-        "ClickHouse user")
-    flag.IntVar(&singleton.ClickHouseBatchSize, "ch-batch", 10000, 
-        "ClickHouse batch size")
-}
-```
-
----
-
-## Part 2: Parquet File Export
-
-### 2.1 Dependencies
-
-Add to `go.mod`:
-```go
-require (
-    github.com/parquet-go/parquet-go v0.23.0
-)
-```
-
-### 2.2 New Package Structure
-
-```
-internal/
-├── parquet/
-│   ├── writer.go        # Parquet file writer
-│   ├── schema.go        # Parquet schema definition
-│   └── writer_test.go   # Unit tests
-```
-
-### 2.3 Parquet Schema
-
-```go
-package parquet
-
-import "github.com/parquet-go/parquet-go"
-
-type EventRow struct {
-    TimestampNs     uint64 `parquet:"timestamp_ns"`
-    Pid             uint32 `parquet:"pid"`
-    Tid             uint32 `parquet:"tid"`
-    Comm            string `parquet:"comm,dict"`
-    SyscallName     string `parquet:"syscall_name,dict"`
-    TraceId         uint32 `parquet:"trace_id"`
-    EventType       uint8  `parquet:"event_type"`
-    RetValue        int64  `parquet:"ret_value"`
-    RetType         uint32 `parquet:"ret_type"`
-    Fd              int32  `parquet:"fd"`
-    Filename        string `parquet:"filename"`
-    Pathname        string `parquet:"pathname"`
-    Oldname         string `parquet:"oldname"`
-    Newname         string `parquet:"newname"`
-    Flags           int32  `parquet:"flags"`
-    DurationNs      uint64 `parquet:"duration_ns"`
-    DurationToPrev  uint64 `parquet:"duration_to_prev_ns"`
-    Hostname        string `parquet:"hostname,dict"`
-    CollectionId    string `parquet:"collection_id"`
-}
-```
-
-**Parquet Optimizations:**
-- `dict` encoding for low-cardinality strings (`comm`, `syscall_name`, `hostname`)
-- Zstd compression (same as current GOB files)
-- Row groups of ~128MB for efficient querying
-- Column pruning - ClickHouse only reads needed columns
-
-### 2.4 Implementation
-
-#### 2.4.1 Writer (`internal/parquet/writer.go`)
-
-```go
-package parquet
-
-import (
-    "context"
-    "fmt"
-    "os"
-    "sync"
-    "time"
-    
-    "github.com/parquet-go/parquet-go"
-    "github.com/parquet-go/parquet-go/compress/zstd"
-    "ior/internal/event"
-)
-
-type Writer struct {
-    file       *os.File
-    writer     *parquet.GenericWriter[EventRow]
-    
-    bufferMu   sync.Mutex
-    buffer     []EventRow
-    bufferSize int
-    
-    rowGroupSize int
-    
-    ctx    context.Context
-    cancel context.CancelFunc
-}
-
-type WriterConfig struct {
-    Filename      string
-    BatchSize     int
-    RowGroupSize  int
-    Compression   parquet.Compression
-}
-
-func DefaultWriterConfig() WriterConfig {
-    hostname, _ := os.Hostname()
-    return WriterConfig{
-        Filename:     fmt.Sprintf("%s-%s.parquet", hostname, time.Now().Format("2006-01-02_15:04:05")),
-        BatchSize:    10000,
-        RowGroupSize: 100000,
-        Compression:  parquet.Zstd,
-    }
-}
-
-func NewWriter(ctx context.Context, config WriterConfig) (*Writer, error) {
-    file, err := os.Create(config.Filename)
-    if err != nil {
-        return nil, err
-    }
-    
-    writer := parquet.NewGenericWriter[EventRow](file,
-        parquet.Compression(&zstd.Codec{Level: zstd.DefaultLevel}),
-        parquet.RowGroupSize(config.RowGroupSize),
-    )
-    
-    ctx, cancel := context.WithCancel(ctx)
-    return &Writer{
-        file:         file,
-        writer:       writer,
-        buffer:       make([]EventRow, 0, config.BatchSize),
-        bufferSize:   config.BatchSize,
-        rowGroupSize: config.RowGroupSize,
-        ctx:          ctx,
-        cancel:       cancel,
-    }, nil
-}
-
-func (w *Writer) Write(ep *event.Pair) error {
-    row := EventRow{
-        TimestampNs:    ep.EnterEv.GetTime(),
-        Pid:           ep.EnterEv.GetPid(),
-        Tid:           ep.EnterEv.GetTid(),
-        Comm:          ep.Comm,
-        SyscallName:   ep.EnterEv.GetTraceId().Name(),
-        DurationNs:     ep.Duration,
-        DurationToPrev: ep.DurationToPrev,
-    }
-    
-    w.bufferMu.Lock()
-    w.buffer = append(w.buffer, row)
-    shouldFlush := len(w.buffer) >= w.bufferSize
-    w.bufferMu.Unlock()
-    
-    if shouldFlush {
-        return w.flush()
-    }
-    return nil
-}
-
-func (w *Writer) flush() error {
-    w.bufferMu.Lock()
-    defer w.bufferMu.Unlock()
-    
-    if len(w.buffer) == 0 {
-        return nil
-    }
-    
-    _, err := w.writer.Write(w.buffer)
-    w.buffer = w.buffer[:0]
-    return err
-}
-
-func (w *Writer) Close() error {
-    w.cancel()
-    if err := w.flush(); err != nil {
-        return err
-    }
-    if err := w.writer.Close(); err != nil {
-        return err
-    }
-    return w.file.Close()
-}
-```
-
-### 2.5 ClickHouse Import Command
-
-After generating a Parquet file:
-
-```bash
-# Local file import
-clickhouse-client --query "
-  INSERT INTO ior.ior_events 
-  FROM INFILE 'hostname-2024-01-15_10:30:00.parquet' 
-  FORMAT Parquet"
-
-# Or via HTTP (remote server)
-curl -X POST 'http://clickhouse:8123/?query=INSERT+INTO+ior.ior_events+FORMAT+Parquet' \
-  --data-binary @hostname-2024-01-15_10:30:00.parquet
-```
-
-### 2.6 New Command-Line Flags
-
-Add to `internal/flags/flags.go`:
-
-```go
-// Parquet file output
-ParquetEnable    bool
-ParquetFilename  string
-ParquetBatchSize int
-ParquetRowGroupSize int
-```
-
----
-
-## Part 3: Shared Components
-
-### 3.1 Event-to-Row Converter
-
-Both ClickHouse and Parquet need to convert `event.Pair` to a row format:
-
-```go
-// internal/export/convert.go
-
-package export
-
-import (
-    "ior/internal/event"
-    "ior/internal/types"
-)
-
-type EventRow struct {
-    TimestampNs     uint64
-    Pid             uint32
-    Tid             uint32
-    Comm            string
-    SyscallName     string
-    TraceId         uint32
-    EventType       uint8
-    RetValue        int64
-    RetType         uint32
-    Fd              int32
-    Filename        string
-    Pathname        string
-    Oldname         string
-    Newname         string
-    Flags           int32
-    DurationNs      uint64
-    DurationToPrev  uint64
-    Hostname        string
-    CollectionId    string
-}
-
-func PairToRow(ep *event.Pair, hostname, collectionId string) EventRow {
-    row := EventRow{
-        TimestampNs:    ep.EnterEv.GetTime(),
-        Pid:           ep.EnterEv.GetPid(),
-        Tid:           ep.EnterEv.GetTid(),
-        Comm:          ep.Comm,
-        SyscallName:   ep.EnterEv.GetTraceId().Name(),
-        TraceId:       uint32(ep.EnterEv.GetTraceId()),
-        DurationNs:     ep.Duration,
-        DurationToPrev: ep.DurationToPrev,
-        Hostname:      hostname,
-        CollectionId:  collectionId,
-    }
-    
-    switch enter := ep.EnterEv.(type) {
-    case *types.OpenEvent:
-        row.EventType = types.ENTER_OPEN_EVENT
-        row.Filename = types.StringValue(enter.Filename[:])
-        row.Flags = enter.Flags
-    case *types.FdEvent:
-        row.EventType = types.ENTER_FD_EVENT
-        row.Fd = enter.Fd
-    }
-    
-    if ret, ok := ep.ExitEv.(*types.RetEvent); ok {
-        row.RetValue = ret.Ret
-        row.RetType = ret.RetType
-    }
-    
-    return row
-}
-```
-
-### 3.2 Output Mode Selector
-
-```go
-// internal/output/output.go
-
-package output
-
-type Output interface {
-    Write(ep *event.Pair) error
-    Close() error
-}
-
-func NewOutput(ctx context.Context, flags flags.Flags) (Output, error) {
-    switch {
-    case flags.ClickHouseEnable:
-        return clickhouse.NewWriter(ctx, ...)
-    case flags.ParquetEnable:
-        return parquet.NewWriter(ctx, ...)
-    case flags.FlamegraphEnable:
-        return flamegraph.NewCollector(ctx, ...)
-    default:
-        return &consoleOutput{}, nil
-    }
-}
-```
-
----
-
-## Part 4: Performance Considerations
-
-### 4.1 Throughput Targets: 100K-1M events/sec
-
-| Component | Strategy |
-|-----------|----------|
-| **Memory** | `sync.Pool` for EventRow reuse, avoid allocations in hot path |
-| **Buffering** | Double buffering: fill one buffer while sending another |
-| **Batching** | Batch inserts: 10K-100K rows per batch |
-| **Compression** | LZ4 for streaming (fast), Zstd for files (compact) |
-| **Concurrency** | Multiple writer goroutines with separate connections |
-| **Backpressure** | Drop events if buffer full (configurable), report stats |
-
-### 4.2 Memory Budget
-
-For 1M events/sec with 10KB per event row:
-- Raw: 10GB/sec (too high)
-- With batching and buffering: ~100MB buffer is sufficient
-
-Buffer sizing:
-- 100K events × ~200 bytes per row = ~20MB per buffer
-- Double buffering = ~40MB total
-
-### 4.3 ClickHouse Server-Side Settings
-
-```sql
-SET max_insert_block_size = 1048576;
-SET max_block_size = 65536;
-SET async_insert_max_data_size = 10000000;
-SET async_insert_busy_timeout_ms = 1000;
-```
-
----
-
-## Part 5: Testing Strategy
-
-### 5.1 Unit Tests
-
-```
-internal/clickhouse/
-├── client_test.go      # Mock server tests
-├── writer_test.go      # Buffer management, batch logic
-└── schema_test.go      # DDL generation
-
-internal/parquet/
-├── writer_test.go      # File writing, schema validation
-└── convert_test.go     # Event to row conversion
-```
-
-### 5.2 Integration Tests
-
-```bash
-docker run -d --name clickhouse -p 9000:9000 clickhouse/clickhouse-server
-make test-integration-clickhouse
-make test-integration-parquet
-```
-
-### 5.3 Benchmark Tests
-
-```go
-// internal/bench_test.go
-func BenchmarkClickHouseWriter(b *testing.B) {}
-
-func BenchmarkParquetWriter(b *testing.B) {}
-```
-
----
-
-## Part 6: Implementation Order
-
-### Phase 1: Foundation
-1. Add dependencies to `go.mod`
-2. Create `internal/export/convert.go` - shared row conversion
-3. Create `internal/output/output.go` - output interface
-
-### Phase 2: Parquet Export
-1. Create `internal/parquet/` package
-2. Add parquet flags to `internal/flags/`
-3. Integrate with `internal/eventloop.go`
-4. Add unit tests
-5. Test ClickHouse import
-
-### Phase 3: ClickHouse Streaming
-1. Create `internal/clickhouse/` package
-2. Add ClickHouse flags to `internal/flags/`
-3. Implement double-buffering writer
-4. Integrate with `internal/eventloop.go`
-5. Add unit tests and integration tests
-
-### Phase 4: Polish
-1. Performance benchmarks and optimization
-2. Documentation
-3. Error handling and recovery
-4. Metrics/monitoring integration
-
----
-
-## Part 7: Usage Examples
-
-### Parquet Export
-
-```bash
-# Capture to Parquet file
-ior -parquet -name my_trace -duration 60
-
-# Import to ClickHouse
-clickhouse-client --query "
-  INSERT INTO ior.ior_events 
-  FROM INFILE 'myhost-2024-01-15_10:30:00.parquet' 
-  FORMAT Parquet"
-```
-
-### Real-time Streaming
-
-```bash
-# Stream to ClickHouse
-ior -clickhouse -ch-host ch-server.example.com -ch-db ior -duration 300
-
-# With authentication
-ior -clickhouse -ch-host ch.example.com -ch-user app -ch-password secret
-```
-
-### Query Examples
-
-```sql
--- Top 10 slowest syscalls by average duration
-SELECT 
-    syscall_name,
-    count() as total,
-    avg(duration_ns) as avg_duration,
-    quantile(0.99)(duration_ns) as p99_duration
-FROM ior_events
-WHERE timestamp_ns > now() - INTERVAL 1 HOUR
-GROUP BY syscall_name
-ORDER BY avg_duration DESC
-LIMIT 10;
-
--- Events per process
-SELECT 
-    pid,
-    comm,
-    count() as total_events,
-    sum(duration_ns) / 1e9 as total_duration_sec
-FROM ior_events
-WHERE collection_id = 'uuid-here'
-GROUP BY pid, comm
-ORDER BY total_events DESC;
-
--- Per-thread I/O activity (leverages bloom filter on tid)
-SELECT 
-    pid,
-    tid,
-    comm,
-    count() as total_events,
-    sum(duration_ns) / 1e9 as total_duration_sec,
-    uniqExact(syscall_name) as unique_syscalls
-FROM ior_events
-WHERE collection_id = 'uuid-here'
-GROUP BY pid, tid, comm
-ORDER BY total_events DESC
-LIMIT 50;
-
--- Thread with most I/O latency (bloom filter helps for specific tid lookup)
-SELECT 
-    pid, tid, comm,
-    sum(duration_ns) / 1e6 as total_latency_ms,
-    avg(duration_ns) as avg_latency_ns,
-    count() as event_count
-FROM ior_events
-WHERE tid = 12345  -- bloom filter index used here
-  AND collection_id = 'uuid-here'
-GROUP BY pid, tid, comm;
-
--- Most accessed files
-SELECT 
-    filename,
-    count() as access_count,
-    sum(duration_ns) / 1e9 as total_duration_sec
-FROM ior_events
-WHERE filename != ''
-GROUP BY filename
-ORDER BY access_count DESC
-LIMIT 20;
-```
-
----
-
-## Appendix A: File Sizes Estimation
-
-For 1M events:
-| Format | Size (estimated) |
-|--------|------------------|
-| GOB + zstd (current) | ~50-80 MB |
-| Parquet + zstd | ~40-60 MB |
-| RowBinary | ~60-80 MB |
-| JSON (not recommended) | ~200-300 MB |
-
----
-
-## Appendix C: ClickHouse Optimization Strategies
-
-### C.1 High Cardinality Optimization
-
-**Problem**: Fields like `tid` (thread ID) can have tens of thousands of unique values, making standard indexing inefficient.
-
-| Technique | Description | Best For |
-|-----------|-------------|----------|
-| **Bloom Filter Index** | Probabilistic index for point lookups (~1% storage overhead) | `WHERE tid = 12345` |
-| **Minmax Index** | Stores min/max values per granule | Range queries on numeric fields |
-| **Set Index** | Stores unique values per granule (limited to ~N values) | Low-medium cardinality |
-| **Token Bloom Filter** | Bloom filter on tokens in string | `WHERE hasToken(filename, 'log')` |
-
-```sql
--- Bloom filter for point lookups on high-cardinality fields
-INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4,
-INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4,
-
--- Minmax for range queries on timestamps (already default, but explicit)
-INDEX idx_duration duration_ns TYPE minmax GRANULARITY 1,
-
--- Token bloom filter for filename substring searches
-INDEX idx_filename_tokens filename TYPE tokenbf_v1(512, 3, 0) GRANULARITY 4
-```
-
-**Bloom Filter Parameters**:
-- `0.01` = 1% false positive rate (lower = more accurate, larger index)
-- `GRANULARITY 4` = index covers 4 granules (8192 × 4 = 32768 rows)
-
-#### C.1.1 Are Bloom Filter Results Inexact?
-
-**No - query results are ALWAYS exact.** Bloom filters only affect performance, not correctness.
-
-```
-How Bloom Filter Indices Work:
-+---------------------------------------------------------------+
-| QUERY: SELECT * FROM ior_events WHERE tid = 12345             |
-+---------------------------------------------------------------+
-                              |
-                              v
-+---------------------------------------------------------------+
-| STEP 1: Check bloom filter index for each granule             |
-|                                                               |
-| Granule 1 (rows 1-8192):     Bloom says "MAYBE" -> READ IT    |
-| Granule 2 (rows 8193-16384): Bloom says "DEFINITELY NOT"      |
-|                              -> SKIP                          |
-| Granule 3 (rows 16385-24576): Bloom says "MAYBE" -> READ IT   |
-| ...                                                           |
-+---------------------------------------------------------------+
-                              |
-                              v
-+---------------------------------------------------------------+
-| STEP 2: Read the "MAYBE" granules and filter EXACTLY          |
-|                                                               |
-| Result: ALL rows where tid = 12345 (no false positives)       |
-+---------------------------------------------------------------+
-```
-
-**False Positive Impact**: The bloom filter might say "MAYBE contains tid=12345" for a granule that doesn't actually have it. This causes ClickHouse to read that granule unnecessarily - wasting I/O but NOT affecting result correctness.
-
-**False Negatives**: Impossible. Bloom filters never produce false negatives. If the value exists, the bloom filter will always say "MAYBE".
-
-**Trade-off**:
-| Bloom Filter Size | False Positive Rate | Storage Overhead | Unnecessary Reads |
-|-------------------|---------------------|------------------|-------------------|
-| Smaller | Higher (~5%) | ~0.5% | More wasted I/O |
-| Larger | Lower (~0.01%) | ~2% | Less wasted I/O |
-
-**Recommendation**: Use `bloom_filter(0.01)` for 1% false positive rate - good balance between index size and read efficiency.
-
-#### C.1.2 Projections vs Materialized Views
-
-Both optimize query performance, but work differently:
-
-| Aspect | Projection | Materialized View |
-|--------|------------|-------------------|
-| **Data storage** | Same table, different physical order | Separate table |
-| **Automatic sync** | Yes - part of the same table | Yes - triggered on INSERT |
-| **Query rewrite** | Automatic - ClickHouse picks best projection | Must query MV explicitly |
-| **Aggregation** | Can pre-aggregate | Often used for aggregation |
-| **Space overhead** | Copies ALL columns (unless aggregated) | Only stores aggregated result |
-| **Best for** | Different sort orders, same columns | Pre-computed aggregations |
-
----
-
-### C.1.3 Projections Explained
-
-**What is a Projection?**
-
-A projection is an alternative physical representation of the SAME data within the SAME table. Think of it as "same data, sorted differently for different queries."
-
-**Example Problem**: Your main table is ordered by `(timestamp_ns, pid, tid)` for time-range queries. But you also need fast queries by thread:
-
-```sql
--- This query is SLOW because tid is last in ORDER BY
-SELECT * FROM ior_events WHERE tid = 12345;
--- Must scan almost all data!
-```
-
-**Solution - Add a Projection**:
-
-```sql
-CREATE TABLE ior_events (
-    timestamp_ns UInt64,
-    pid UInt32,
-    tid UInt32,
-    comm String,
-    syscall_name String,
-    duration_ns UInt64,
-    -- ... other columns ...
-)
-ENGINE = MergeTree()
-ORDER BY (timestamp_ns, pid, tid)  -- Primary order: time-first
-PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns));
-
--- Add projection for thread-centric queries
-ALTER TABLE ior_events ADD PROJECTION proj_by_thread (
-    SELECT * ORDER BY (pid, tid, timestamp_ns)
-);
-
--- Materialize the projection (build it for existing data)
-ALTER TABLE ior_events MATERIALIZE PROJECTION proj_by_thread;
-```
-
-**How it works**:
-
-```
-INSERT INTO ior_events VALUES (1000, 1, 100, 'app', 'read', 50);
-INSERT INTO ior_events VALUES (2000, 1, 100, 'app', 'write', 30);
-INSERT INTO ior_events VALUES (1500, 2, 200, 'db', 'read', 40);
-
-Main storage (ORDER BY timestamp_ns, pid, tid):
-+--------------+-----+-----+------+-------------+-------------+
-| timestamp_ns | pid | tid | comm | syscall_name| duration_ns |
-+--------------+-----+-----+------+-------------+-------------+
-| 1000         | 1   | 100 | app  | read        | 50          |
-| 1500         | 2   | 200 | db   | read        | 40          |
-| 2000         | 1   | 100 | app  | write       | 30          |
-+--------------+-----+-----+------+-------------+-------------+
-
-Projection proj_by_thread (ORDER BY pid, tid, timestamp_ns):
-+--------------+-----+-----+------+-------------+-------------+
-| timestamp_ns | pid | tid | comm | syscall_name| duration_ns |
-+--------------+-----+-----+------+-------------+-------------+
-| 1000         | 1   | 100 | app  | read        | 50          |
-| 2000         | 1   | 100 | app  | write       | 30          |
-| 1500         | 2   | 200 | db   | read        | 40          |
-+--------------+-----+-----+------+-------------+-------------+
-```
-
-**Query optimization**:
-
-```sql
--- Query 1: Time range - uses main storage
-SELECT * FROM ior_events 
-WHERE timestamp_ns BETWEEN 1000 AND 1800;
--- Reads rows in order: (1000,1,100), (1500,2,200) - efficient!
-
--- Query 2: Thread lookup - uses projection AUTOMATICALLY
-SELECT * FROM ior_events WHERE tid = 100;
--- ClickHouse sees projection has better ORDER BY for this query
--- Uses proj_by_thread: rows (1,100,1000), (1,100,2000) are adjacent!
-```
-
-**Aggregating Projection** (smaller storage):
-
-```sql
--- Pre-aggregated projection - stores only aggregated data
-ALTER TABLE ior_events ADD PROJECTION proj_hourly_stats (
-    SELECT 
-        toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour,
-        syscall_name,
-        count() as event_count,
-        sum(duration_ns) as total_duration,
-        avg(duration_ns) as avg_duration
-    GROUP BY hour, syscall_name
-);
-
-ALTER TABLE ior_events MATERIALIZE PROJECTION proj_hourly_stats;
-```
-
-**Storage comparison**:
-```
-Main table:        1,000,000,000 rows × ~200 bytes = ~200 GB
-proj_by_thread:    1,000,000,000 rows × ~200 bytes = ~200 GB (full copy)
-proj_hourly_stats: ~100,000 rows × ~50 bytes = ~5 MB (aggregated!)
-```
-
-**When to use projections**:
-- Different access patterns on the same raw data
-- Query patterns known upfront
-- Can afford storage overhead (projections copy data)
-
----
-
-### C.1.4 Materialized Views Explained
-
-**What is a Materialized View?**
-
-A materialized view is a SEPARATE table that is automatically populated and maintained when data is inserted into the source table.
-
-**Example Problem**: You frequently query per-thread statistics:
-
-```sql
--- This is slow - scans billions of rows every time
-SELECT pid, tid, comm, count(), sum(duration_ns)
-FROM ior_events
-WHERE timestamp_ns > now() - INTERVAL 1 HOUR
-GROUP BY pid, tid, comm;
-```
-
-**Solution - Create a Materialized View**:
-
-```sql
--- Step 1: Create the target table (stores the aggregated data)
-CREATE TABLE ior_thread_stats (
-    day Date,
-    hour DateTime,
-    pid UInt32,
-    tid UInt32,
-    comm LowCardinality(String),
-    syscall_name LowCardinality(String),
-    event_count UInt64,
-    total_duration_ns UInt64
-)
-ENGINE = SummingMergeTree()  -- Automatically sums duplicates
-PARTITION BY day
-ORDER BY (day, hour, pid, tid, syscall_name);
-
--- Step 2: Create the materialized view (transforms INSERTs)
-CREATE MATERIALIZED VIEW ior_thread_stats_mv TO ior_thread_stats AS
-SELECT
-    toDate(fromUnixTimestamp64Nano(timestamp_ns)) as day,
-    toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour,
-    pid,
-    tid,
-    comm,
-    syscall_name,
-    count() as event_count,
-    sum(duration_ns) as total_duration_ns
-FROM ior_events
-GROUP BY day, hour, pid, tid, comm, syscall_name;
-```
-
-**How it works**:
-
-```
-INSERT INTO ior_events (timestamp_ns, pid, tid, comm, syscall_name, duration_ns, ...)
-VALUES (1704067200000000000, 1, 100, 'app', 'read', 50, ...);
-
-                              |
-                              v
-+---------------------------------------------------------------+
-| ior_events (main table)                                       |
-| Receives the full row                                         |
-+---------------------------------------------------------------+
-                              |
-                              | TRIGGER: materialized view
-                              v
-+---------------------------------------------------------------+
-| ior_thread_stats_mv transformation                            |
-| Groups and aggregates:                                        |
-|   day=2024-01-01, hour=2024-01-01 00:00:00                    |
-|   pid=1, tid=100, comm='app', syscall_name='read'             |
-|   event_count=1, total_duration_ns=50                         |
-+---------------------------------------------------------------+
-                              |
-                              v
-+---------------------------------------------------------------+
-| ior_thread_stats (target table)                               |
-| Stores only the aggregated row                                |
-+---------------------------------------------------------------+
-```
-
-**Query the materialized view**:
-
-```sql
--- Query the aggregated table (MUCH faster!)
-SELECT 
-    pid, tid, comm,
-    sum(event_count) as total_events,
-    sum(total_duration_ns) / 1e9 as total_seconds
-FROM ior_thread_stats
-WHERE day = today() AND hour > now() - INTERVAL 6 HOUR
-GROUP BY pid, tid, comm
-ORDER BY total_events DESC
-LIMIT 10;
-```
-
-**Real-time aggregation example**:
-
-```sql
--- Per-minute syscall latency tracking
-CREATE TABLE ior_latency_minute (
-    minute DateTime,
-    syscall_name LowCardinality(String),
-    p50_ns UInt64,
-    p90_ns UInt64,
-    p99_ns UInt64,
-    count UInt64
-)
-ENGINE = SummingMergeTree()
-ORDER BY (minute, syscall_name);
-
--- Materialized view with quantile aggregation
-CREATE MATERIALIZED VIEW ior_latency_minute_mv TO ior_latency_minute AS
-SELECT
-    toStartOfMinute(fromUnixTimestamp64Nano(timestamp_ns)) as minute,
-    syscall_name,
-    quantile(0.50)(duration_ns) as p50_ns,
-    quantile(0.90)(duration_ns) as p90_ns,
-    quantile(0.99)(duration_ns) as p99_ns,
-    count() as count
-FROM ior_events
-GROUP BY minute, syscall_name;
-
--- Query for real-time monitoring
-SELECT minute, syscall_name, p99_ns / 1e6 as p99_ms, count
-FROM ior_latency_minute
-WHERE minute > now() - INTERVAL 1 HOUR
-ORDER BY minute, p99_ns DESC;
-```
-
-**When to use materialized views**:
-- Pre-computed aggregations (counts, sums, quantiles)
-- Different granularity (hourly, daily summaries)
-- Dashboards and real-time monitoring
-- When you can afford slight delay (async update)
-
-**Important considerations**:
-```sql
--- MVs are populated only for NEW data, not historical
--- To backfill existing data:
-INSERT INTO ior_thread_stats
-SELECT ... FROM ior_events WHERE ... GROUP BY ...;
-
--- MVs can be chained (MV feeding another MV)
--- MVs can be suspended/resumed:
-SYSTEM STOP MERGES ior_thread_stats;
-SYSTEM START MERGES ior_thread_stats;
-```
-
----
-
-### C.1.5 Comparison Summary
-
-```
-PROJECTION: Same table, different order
-+--------------------------------------------------------------+
-| ior_events (main table)                                      |
-| ORDER BY (timestamp_ns, pid, tid)                            |
-| +-- Part 1: [rows sorted by time...]                         |
-| +-- Part 2: [rows sorted by time...]                         |
-+--------------------------------------------------------------+
-        |
-        +-- PROJECTION proj_by_thread
-            +--------------------------------------------------+
-            | Same data, ORDER BY (pid, tid, timestamp_ns)    |
-            | +-- Part 1: [rows sorted by thread...]           |
-            | +-- Part 2: [rows sorted by thread...]           |
-            +--------------------------------------------------+
-
-MATERIALIZED VIEW: Separate table, transformed data
-+--------------------------------------------------------------+
-| ior_events (source table)                                    |
-| 1,000,000,000 rows x 200 bytes = 200 GB                     |
-+--------------------------------------------------------------+
-        |
-        | INSERT triggers transformation
-        v
-+--------------------------------------------------------------+
-| ior_thread_stats (target table)                              |
-| 100,000 rows x 50 bytes = 5 MB                               |
-| Aggregated: count, sum, avg per thread per hour              |
-+--------------------------------------------------------------+
-```
-
-**Choose Projection when**:
-- Need different sort orders for same raw data
-- Query patterns vary (time-range vs thread-lookup)
-- Storage overhead is acceptable
-
-**Choose Materialized View when**:
-- Need pre-aggregated results
-- Query same aggregations repeatedly
-- Want to reduce data volume significantly
-- Building dashboards/monitoring
-
-**Avoid for High Cardinality**:
-- `LowCardinality()` - only for fields with <10k unique values
-- First position in `ORDER BY` - kills compression
-
-### C.2 Large Data Volume Optimization
-
-**Problem**: Billions of rows, terabytes of data need efficient storage and query.
-
-| Technique | Description | Impact |
-|-----------|-------------|--------|
-| **Partitioning** | Split data by time (day/month) | Faster deletes, pruning, TTL |
-| **Compression Codecs** | ZSTD, LZ4, Delta, Gorilla | 5-10x storage reduction |
-| **TTL** | Automatic data expiration | Storage management |
-| **Projections** | Pre-computed alternative ORDER BY | Multiple query patterns |
-| **Materialized Views** | Pre-aggregations | Faster analytics |
-
-```sql
-CREATE TABLE ior_events (
-    -- ... columns ...
-    
-    -- Compression codecs per column type
-    timestamp_ns    UInt64 CODEC(Delta, ZSTD(3)),
-    pid             UInt32 CODEC(ZSTD(3)),
-    tid             UInt32 CODEC(ZSTD(3)),
-    comm            LowCardinality(String) CODEC(ZSTD(3)),
-    syscall_name    LowCardinality(String) CODEC(ZSTD(3)),
-    duration_ns     UInt64 CODEC(Delta, ZSTD(3)),
-    filename        String CODEC(ZSTD(3)),
-    
-    -- Indices
-    INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4,
-    INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4
-)
-ENGINE = MergeTree()
-PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns))
-ORDER BY (timestamp_ns, pid, tid)
-TTL timestamp_ns + INTERVAL 30 DAY DELETE  -- Auto-delete after 30 days
-SETTINGS 
-    index_granularity = 8192,
-    min_bytes_for_wide_part = '10M';  -- Compact parts until 10MB
-```
-
-**Compression Codec Selection**:
-
-| Column Type | Recommended Codec | Reason |
-|-------------|-------------------|--------|
-| Timestamps | `Delta(8), ZSTD(3)` | Delta encoding + compression |
-| IDs (pid, tid) | `ZSTD(3)` | General compression |
-| Low-cardinality strings | `LowCardinality, ZSTD(3)` | Dictionary + compression |
-| Durations | `Delta, ZSTD(3)` | Sequential values compress well |
-| Filenames | `ZSTD(3)` or `ZSTD(1)` | High compression if CPU-bound |
-| Flags/enums | `ZSTD(3)` | Small value range |
-
-### C.3 Combined: High Cardinality + Large Data Volume
-
-**The Challenge**: Billions of rows with high-cardinality fields (tid, filename) require both efficient storage AND fast point lookups.
-
-**Strategy 1: Optimal ORDER BY**
-
-```sql
--- Rule: Low cardinality first, high cardinality last
--- BAD: ORDER BY (tid, timestamp_ns)  -- Kills compression
--- GOOD: ORDER BY (timestamp_ns, pid, tid)  -- Time-based locality
-
-ORDER BY (timestamp_ns, pid, tid)
-```
-
-This enables:
-- Time-range queries: `WHERE timestamp_ns BETWEEN x AND y` (primary key efficiency)
-- Per-process queries: `WHERE timestamp_ns BETWEEN x AND y AND pid = 123`
-- Per-thread queries: `WHERE timestamp_ns BETWEEN x AND y AND pid = 123 AND tid = 456`
-
-**Strategy 2: Projections for Different Access Patterns**
-
-```sql
-CREATE TABLE ior_events (
-    -- ... columns ...
-)
-ENGINE = MergeTree()
-PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns))
-ORDER BY (timestamp_ns, pid, tid)
-
--- Projection 1: Thread-centric queries (fast tid lookup)
-PROJECTION proj_by_thread
-(
-    SELECT *
-    ORDER BY (pid, tid, timestamp_ns)
-)
-
--- Projection 2: Syscall analytics (aggregation-heavy)
-PROJECTION proj_by_syscall
-(
-    SELECT 
-        syscall_name,
-        toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour,
-        pid,
-        count() as event_count,
-        sum(duration_ns) as total_duration,
-        avg(duration_ns) as avg_duration,
-        quantile(0.99)(duration_ns) as p99_duration
-    GROUP BY syscall_name, hour, pid
-)
-;
-```
-
-**Strategy 3: Materialized Views for Pre-Aggregation**
-
-```sql
--- Real-time per-thread stats (much smaller table)
-CREATE MATERIALIZED VIEW ior_thread_stats_mv
-ENGINE = SummingMergeTree()
-PARTITION BY toYYYYMM(day)
-ORDER BY (day, pid, tid, syscall_name)
-AS SELECT
-    toDate(fromUnixTimestamp64Nano(timestamp_ns)) as day,
-    pid,
-    tid,
-    comm,
-    syscall_name,
-    count() as event_count,
-    sum(duration_ns) as total_duration_ns
-FROM ior_events
-GROUP BY day, pid, tid, comm, syscall_name;
-
--- Query the materialized view instead (100-1000x faster)
-SELECT pid, tid, sum(event_count), sum(total_duration_ns)/1e9 as sec
-FROM ior_thread_stats_mv
-WHERE day = today() AND tid = 12345
-GROUP BY pid, tid;
-```
-
-**Strategy 4: Sampling for Exploratory Queries**
-
-```sql
--- Sample 1% of data for quick exploration
-SELECT syscall_name, count() as cnt, avg(duration_ns) as avg_dur
-FROM ior_events
-SAMPLE 0.01  -- Only scan 1% of rows
-WHERE timestamp_ns > now() - INTERVAL 1 DAY
-GROUP BY syscall_name
-ORDER BY cnt DESC;
-```
-
-### C.4 Complete Optimized Schema
-
-```sql
-CREATE TABLE ior_events (
-    -- Core fields
-    timestamp_ns    UInt64 CODEC(Delta(8), ZSTD(3)),
-    pid             UInt32 CODEC(ZSTD(3)),
-    tid             UInt32 CODEC(ZSTD(3)),
-    comm            LowCardinality(String) CODEC(ZSTD(3)),
-    syscall_name    LowCardinality(String) CODEC(ZSTD(3)),
-    trace_id        UInt16 CODEC(ZSTD(3)),
-    event_type      UInt8 CODEC(T64, ZSTD(3)),
-    
-    -- Result
-    ret_value       Int64 CODEC(ZSTD(3)),
-    ret_type        UInt8 CODEC(T64, ZSTD(3)),
-    
-    -- File information
-    fd              Int32 CODEC(ZSTD(3)),
-    filename        String CODEC(ZSTD(3)),
-    pathname        String CODEC(ZSTD(3)),
-    oldname         String CODEC(ZSTD(3)),
-    newname         String CODEC(ZSTD(3)),
-    
-    -- Flags
-    flags           Int32 CODEC(ZSTD(3)),
-    
-    -- Calculated
-    duration_ns     UInt64 CODEC(Delta, ZSTD(3)),
-    duration_to_prev_ns UInt64 CODEC(Delta, ZSTD(3)),
-    
-    -- Context
-    hostname        LowCardinality(String) CODEC(ZSTD(3)),
-    collection_id   UUID CODEC(ZSTD(3)),
-    ingested_at     DateTime64(3) DEFAULT now64(3) CODEC(Delta, ZSTD(3)),
-
-    -- Data skipping indices
-    INDEX idx_pid pid TYPE bloom_filter(0.01) GRANULARITY 4,
-    INDEX idx_tid tid TYPE bloom_filter(0.01) GRANULARITY 4,
-    INDEX idx_duration duration_ns TYPE minmax GRANULARITY 2,
-    INDEX idx_syscall syscall_name TYPE set(100) GRANULARITY 4,
-    INDEX idx_filename_token filename TYPE tokenbf_v1(256, 2, 0) GRANULARITY 4
-)
-ENGINE = MergeTree()
-PARTITION BY toYYYYMMDD(fromUnixTimestamp64Nano(timestamp_ns))
-ORDER BY (timestamp_ns, pid, tid)
-TTL timestamp_ns + INTERVAL 90 DAY DELETE
-SETTINGS 
-    index_granularity = 8192,
-    min_bytes_for_wide_part = '10M',
-    min_rows_for_wide_part = 100000;
-
--- Projections for different access patterns
-ALTER TABLE ior_events ADD PROJECTION proj_by_pid_tid (
-    SELECT * ORDER BY (pid, tid, timestamp_ns)
-);
-
-ALTER TABLE ior_events ADD PROJECTION proj_by_syscall_hour (
-    SELECT 
-        syscall_name,
-        toStartOfHour(fromUnixTimestamp64Nano(timestamp_ns)) as hour,
-        pid,
-        count() as cnt,
-        sum(duration_ns) as total_dur,
-        avg(duration_ns) as avg_dur
-    GROUP BY syscall_name, hour, pid
-);
-
-ALTER TABLE ior_events MATERIALIZE PROJECTION proj_by_pid_tid;
-ALTER TABLE ior_events MATERIALIZE PROJECTION proj_by_syscall_hour;
-```
-
-### C.5 Query Optimization Tips
-
-```sql
--- 1. Always include time range (partition pruning)
--- GOOD
-SELECT * FROM ior_events 
-WHERE timestamp_ns > now() - INTERVAL 1 HOUR AND tid = 12345;
-
--- BAD (scans all partitions)
-SELECT * FROM ior_events WHERE tid = 12345;
-
--- 2. Use PREWHERE for filter pushdown on large columns
-SELECT count() 
-FROM ior_events
-PREWHERE syscall_name = 'open'  -- Filter before reading other columns
-WHERE timestamp_ns > now() - INTERVAL 1 HOUR;
-
--- 3. Leverage projections explicitly
-SET allow_experimental_projection_optimization = 1;
-SELECT syscall_name, sum(cnt) 
-FROM ior_events 
-WHERE timestamp_ns > now() - INTERVAL 1 DAY
-GROUP BY syscall_name;  -- Will use proj_by_syscall_hour
-
--- 4. Use materialized views for frequent aggregations
-SELECT * FROM ior_thread_stats_mv WHERE day = today();
-
--- 5. Parallel replica reads for large scans
-SET max_parallel_replicas = 3;
-SELECT count() FROM ior_events WHERE timestamp_ns > now() - INTERVAL 7 DAY;
-```
-
-### C.6 Storage Estimation
-
-For 1 billion events (1 day at ~11.5K events/sec):
-
-| Metric | Estimate |
-|--------|----------|
-| Raw row size | ~200 bytes |
-| Uncompressed | ~200 GB |
-| With ZSTD compression | ~30-50 GB |
-| With projections | +20-30% additional |
-| Bloom filter indices | +1-2% |
-| Total storage per day | ~40-65 GB |
-
-With 90-day TTL: ~3.6-6 TB total storage.
-
-### C.7 Server Configuration
-
-```xml
-<!-- /etc/clickhouse-server/config.d/ior.xml -->
-<clickhouse>
-    <!-- For high-cardinality + large data -->
-    <merge_tree>
-        <index_granularity>8192</index_granularity>
-        <min_bytes_for_wide_part>10485760</min_bytes_for_wide_part>
-        <min_rows_for_wide_part>100000</min_rows_for_wide_part>
-        <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
-    </merge_tree>
-    
-    <!-- Memory for queries -->
-    <max_memory_usage>16000000000</max_memory_usage>  <!-- 16GB -->
-    <max_memory_usage_for_all_queries>20000000000</max_memory_usage_for_all_queries>
-    
-    <!-- Insert performance -->
-    <max_insert_block_size>1048576</max_insert_block_size>
-    <min_insert_block_size_rows>100000</min_insert_block_size_rows>
-    
-    <!-- Background merges -->
-    <background_pool_size>16</background_pool_size>
-    <background_merges_mutations_concurrency_ratio>4</background_merges_mutations_concurrency_ratio>
-</clickhouse>
-```
-
----
-
-## Appendix B: Alternative Approaches Considered
-
-| Approach | Pros | Cons | Verdict |
-|----------|------|------|---------|
-| Arrow IPC | Standard, fast | Larger files | Good for streaming, not file storage |
-| Protobuf | Compact, schema evolution | Requires CH parsing | Overkill |
-| CSV | Simple | No compression, no types | Not suitable |
-| Native format | Most efficient for CH | Not portable | Consider for streaming |
-\ No newline at end of file
author	Paul Buetow <paul@buetow.org>	2026-03-19 08:46:12 +0200
committer	Paul Buetow <paul@buetow.org>	2026-03-19 08:46:12 +0200
commit	928791f23b69d11441608fd48b317fe28a7ac89d (patch)
tree	d4d7ecb216f575cf406a9a83bfe0fd998597d9f2 /CLICKHOUSE.md
parent	cd554b0af706b5f62b4e1bfde04091052b4aac61 (diff)