benchmarks/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232

# DTail Benchmarks

This directory contains comprehensive benchmarks for the DTail toolset (dcat, dgrep, dmap).

## Overview

The benchmarking framework tests performance across:
- Different file sizes (10MB, 100MB, 1GB)
- Various compression formats (none, gzip, zstd)
- Different query patterns and complexities
- Server mode vs serverless operation

## Prerequisites

Before running benchmarks, ensure all DTail binaries are built:

```bash
cd ..
make build
```

## Running Benchmarks

### Quick Benchmarks (Small Files Only)
```bash
go test -bench=BenchmarkQuick ./benchmarks
```

### All Benchmarks
```bash
go test -bench=. ./benchmarks
```

### Creating Baseline Snapshots
```bash
# Create a baseline before making changes (will prompt for name)
make benchmark-baseline

# Create a quick baseline (small files only, will prompt for name)
make benchmark-baseline-quick

# Create a baseline with a descriptive tag (no prompt)
./dtail-tools benchmark -mode baseline -tag "before-optimization"

# Create a baseline interactively (will prompt if no tag provided)
make benchmark-baseline

# Create a comprehensive baseline (3x iterations)
./dtail-tools benchmark -mode baseline -iterations 3x -tag "v1.0-release"
```

### Comparing Performance
```bash
# Compare with a specific baseline using make
make benchmark-compare BASELINE=benchmarks/baselines/baseline_20240125_143022.txt

# Use the benchmark script for more options
./dtail-tools benchmark -mode compare -baseline benchmarks/baselines/baseline_20240125_143022.txt

# List available baselines
./dtail-tools benchmark -mode list
```

### Specific Tool Benchmarks
```bash
# DCat benchmarks only
go test -bench=BenchmarkDCat ./benchmarks

# DGrep benchmarks only
go test -bench=BenchmarkDGrep ./benchmarks

# DMap benchmarks only
go test -bench=BenchmarkDMap ./benchmarks
```

### With Memory Profiling
```bash
go test -bench=. -benchmem ./benchmarks
```

### Custom Configuration
```bash
# Run with specific file sizes
DTAIL_BENCH_SIZES=small,medium go test -bench=. ./benchmarks

# Keep temporary files for inspection
DTAIL_BENCH_KEEP_FILES=true go test -bench=. ./benchmarks

# Set custom timeout
DTAIL_BENCH_TIMEOUT=30m go test -bench=. ./benchmarks
```

## Benchmark Categories

### DCat Benchmarks
- **Simple**: Sequential file reading
- **Multiple Files**: Reading 10-100 files concurrently
- **Compressed**: Performance with gzip/zstd compression
- **Server Mode**: Client-server performance comparison

### DGrep Benchmarks
- **Simple Pattern**: Basic string matching with varying hit rates
- **Regex Pattern**: Complex regex performance
- **Context Lines**: Impact of --before/--after flags
- **Inverted**: Performance of --invert grep
- **Compressed**: Grep on compressed files

### DMap Benchmarks
- **Simple Aggregation**: Basic count, sum, avg operations
- **Group By Cardinality**: Performance with different group sizes
- **Complex Queries**: WHERE clauses and multiple conditions
- **Time Intervals**: Time-based grouping performance
- **Custom Functions**: Performance of maskdigits, md5sum, etc.

## Output

Benchmark results are saved in multiple formats:
- `benchmark_results/results_TIMESTAMP.json` - Machine-readable JSON
- `benchmark_results/results_TIMESTAMP.csv` - Spreadsheet-compatible CSV
- `benchmark_results/results_TIMESTAMP.md` - Human-readable Markdown report
- `benchmark_results/latest.json` - Most recent results for easy access

## Interpreting Results

Key metrics:
- **MB/sec**: Throughput in megabytes per second
- **lines/sec**: Lines processed per second
- **compression_ratio**: For compressed file benchmarks
- **matched_lines**: For grep benchmarks
- **approx_groups**: For MapReduce group by operations

## Performance Tuning

For accurate benchmarks:
1. Run on isolated hardware
2. Disable CPU frequency scaling
3. Close unnecessary applications
4. Run multiple times and average results

## Continuous Integration

The benchmarks can be integrated into CI/CD pipelines:

```yaml
# Example GitHub Actions workflow
- name: Run Benchmarks
  run: |
    make build
    go test -bench=BenchmarkQuick ./benchmarks
```

## Troubleshooting

### "Command not found" errors
Ensure DTail binaries are built: `make build`

### Disk space issues
Benchmarks create large temporary files. Ensure sufficient disk space (>2GB).

### Timeout errors
Increase timeout: `DTAIL_BENCH_TIMEOUT=60m go test -bench=. ./benchmarks`

## Baseline Management

The benchmarking framework includes tools for creating and comparing performance baselines:

### Creating Baselines
Baselines capture the complete benchmark output including:
- Git commit hash
- Timestamp
- All benchmark results with timing and memory allocation data
- Descriptive names for easy identification

The system will prompt for a meaningful baseline name to ensure proper documentation:

```bash
# Simple baseline (prompts for name)
make benchmark-baseline
> Enter a descriptive name for this baseline: before-cache-optimization

# Quick baseline for rapid testing (prompts for name)
make benchmark-baseline-quick
> Enter a descriptive name for this baseline: initial-performance-check

# Tagged baseline with description (no prompt)
./dtail-tools benchmark -mode baseline -tag "before-refactoring"

# Full baseline with multiple iterations
./dtail-tools benchmark -mode baseline -iterations 3x -memory -tag "release-v2.0"
```

Baseline files are named with the pattern:
`baseline_YYYYMMDD_HHMMSS_descriptive-name.txt`

### Comparing Performance
Compare current performance against a baseline to detect regressions or improvements:

```bash
# Using make
make benchmark-compare BASELINE=benchmarks/baselines/baseline_20240125_143022.txt

# Using benchmark script (provides benchstat analysis if available)
./dtail-tools benchmark -mode compare -baseline benchmarks/baselines/baseline_20240125_143022.txt
```

### Managing Baselines
```bash
# List all baselines
./dtail-tools benchmark -mode list

# View a specific baseline
cat benchmarks/baselines/baseline_20240125_143022.txt

# Clean old baselines (keeps last 10)
./dtail-tools benchmark -mode clean
```

### Best Practices for Baselines
1. Create a baseline before starting optimization work
2. Tag baselines with descriptive names (e.g., "before-cache-impl", "v1.0-release")
3. Use full baselines for release comparisons
4. Commit important baseline files to version control for team reference
5. Run benchmarks on consistent hardware for accurate comparisons

## Contributing

When adding new benchmarks:
1. Follow existing naming conventions
2. Include warmup runs
3. Report relevant metrics
4. Clean up temporary files
5. Document in this README