diff options
| author | Paul Buetow <paul@buetow.org> | 2025-08-31 15:15:50 +0300 |
|---|---|---|
| committer | Paul Buetow <paul@buetow.org> | 2025-08-31 15:15:50 +0300 |
| commit | 114379d591fbbf6afeb649521be0a1aab2bbd3aa (patch) | |
| tree | 6a9d206b375b518c7cf6de70f0733630eafcfbcd | |
| parent | c508f9b24be667d5d078034b4d4e7bb07b28da69 (diff) | |
Implement REVIEW.md suggestions:\n- Simplify gemfeed detection in Aggregator\n- Make log globs configurable for Logreader\n- Remove debug-only imports; add comments\n- Add HTML::Entities usage; document deps\n- Add tests under t/ and make script require-safe\n- Add .perltidyrc and Justfile; add dev tasks\n- Update README and REVIEW with changes
| -rw-r--r-- | .perltidyrc | 18 | ||||
| -rw-r--r-- | Justfile | 26 | ||||
| -rw-r--r-- | README.md | 28 | ||||
| -rw-r--r-- | REVIEW.md | 59 | ||||
| -rw-r--r-- | foostats.pl | 157 | ||||
| -rw-r--r-- | t/00-load.t | 9 | ||||
| -rw-r--r-- | t/01-aggregator.t | 35 | ||||
| -rw-r--r-- | t/02-filter.t | 24 | ||||
| -rw-r--r-- | t/03-logreader-web.t | 28 | ||||
| -rw-r--r-- | t/data/access.log | 1 | ||||
| -rw-r--r-- | t/tmp_filter_log | 25 | ||||
| -rw-r--r-- | t/tmp_odds.txt | 1 |
12 files changed, 332 insertions, 79 deletions
diff --git a/.perltidyrc b/.perltidyrc new file mode 100644 index 0000000..9dbc325 --- /dev/null +++ b/.perltidyrc @@ -0,0 +1,18 @@ +# Perltidy configuration for foostats + +# Line length and indentation +-l=100 +-i=4 +-ci=4 + +# Vertical and parenthesis tightness +-vt=2 +-pt=2 + +# Backup original and write in place +-b +-bext=.bak + +# Keep errors summary +-se + diff --git a/Justfile b/Justfile new file mode 100644 index 0000000..64c2242 --- /dev/null +++ b/Justfile @@ -0,0 +1,26 @@ +set shell := ["bash", "-cu"] + +# Format Perl code using perltidy (uses .perltidyrc) +format: + perltidy foostats.pl + +# Run tests via prove if available, otherwise fallback to perl runner +test: + if command -v prove >/dev/null 2>&1; then \ + prove -lr t; \ + else \ + for f in t/*.t; do echo "== $$f =="; PERL5LIB=. perl "$$f" || exit 1; done; \ + fi + +# Syntax check +check: + perl -c foostats.pl + +clean: + find . -type f \( -name "*.bak" -o -name "*~" -o -name ".#*" \) -print0 | xargs -0r rm -f + +# Generate Gemtext and HTML reports into local directories from local stats +reports: + mkdir -p out_gmi out_html + perl foostats.pl --report --stats-dir stats --output-dir out_gmi --html-output-dir out_html + echo "Generated Gemtext in out_gmi and HTML in out_html" @@ -17,9 +17,19 @@ A privacy-respecting web analytics tool for OpenBSD that processes HTTP/HTTPS an On OpenBSD, install dependencies: ```sh -doas pkg_add p5-Digest-SHA3 p5-PerlIO-gzip p5-JSON p5-String-Util p5-LWP-Protocol-https +doas pkg_add p5-Digest-SHA3 p5-PerlIO-gzip p5-JSON p5-String-Util p5-LWP-Protocol-https p5-HTML-Parser perltidy ``` +## Dependencies + +- Perl 5.38+ +- Core: `Time::Piece`, `Getopt::Long`, `Sys::Hostname`, `File::Basename` +- CPAN/Packages: `Digest::SHA3`, `PerlIO::gzip`, `JSON`, `String::Util`, `LWP::UserAgent` (and HTTPS support), `HTML::Entities` + +Notes: +- On OpenBSD the packages are: `p5-Digest-SHA3`, `p5-PerlIO-gzip`, `p5-JSON`, `p5-String-Util`, `p5-LWP-Protocol-https`, `p5-HTML-Parser` (provides `HTML::Entities`). +- The script expects Perl 5.38 features; adjust accordingly if running older Perl. + ## Usage ### Basic Operations @@ -129,3 +139,19 @@ Reports include: ## License BSD 3-Clause License (see LICENSE file) +## Testing + +Basic test suite using Test::More is included under `t/`. + +- Run all tests: `prove -lr t` +- The script now avoids running its CLI when loaded via `require`, enabling unit testing of internal packages. + +## Development + +- Format: `just format` (uses `.perltidyrc`, requires `perltidy`) +- Test: `just test` (uses `prove` if available, otherwise runs tests via `perl`) +- Lint/syntax: `just check` +- Cleanup: `just clean` +- Generate reports (from repo's `stats/`): `just reports` + +Tip: install `just` (a command runner) for convenience. diff --git a/REVIEW.md b/REVIEW.md new file mode 100644 index 0000000..1f4b52e --- /dev/null +++ b/REVIEW.md @@ -0,0 +1,59 @@ +# Review of foostats.pl + +This review provides suggestions for improving the `foostats.pl` Perl script. The script is well-structured and uses modern Perl features, but there are several areas where it could be enhanced. + +## 1. Code Duplication + +There are some instances of code duplication that could be refactored for better maintainability: + +* In `Foostats::Aggregator::add_feed_ips`, the code for handling `GEMFEED_URI` and `GEMFEED_URI_2` is identical. This could be simplified by using a loop or a regex. +* The logic for parsing web and gemini logs in `Foostats::Logreader` shares some similarities and could potentially be generalized. + +Implemented: +- Simplified `add_feed_ips` with a single regex that matches `/gemfeed/` and `/gemfeed/index.gmi` (including optional query strings), removing duplicated branches. +- Made log path globs configurable at runtime via `FOOSTATS_WEB_LOGS_GLOB` and `FOOSTATS_GEMINI_LOGS_GLOB` and refactored constant globs into functions. This reduces coupling and enables shared `read_lines` usage for both parsers and testing. + +## 2. Dependencies + +* **List dependencies in `README.md`:** At a minimum, list the required modules in the `README.md` file with instructions on how to install them. + +Implemented: +- Added a dedicated Dependencies section in `README.md` with core and CPAN modules. +- Included OpenBSD package names, adding `p5-HTML-Parser` (provides `HTML::Entities`) which is used by the HTML report generator. + +## 3. Testing + +The script lacks an automated test suite. Adding tests would significantly improve its reliability and make future development easier and safer. + +* **Use `Test::More`:** Perl's core `Test::More` module is the standard for writing tests. +* **Test individual components:** Write tests for each package, especially for the log parsing and data aggregation logic. +* **Create mock data:** Create sample log files to test the parsing logic under different scenarios. + +Implemented: +- Converted the script to avoid executing the CLI when `require`-d by wrapping the main flow into `foostats_main()` and calling it `unless caller`. This allows unit testing of internal packages. +- Added basic tests under `t/`: + - `t/00-load.t`: loads the script. + - `t/01-aggregator.t`: verifies feed IP aggregation for Atom and Gemfeed and page IP tracking. + - `t/02-filter.t`: verifies excessive request filtering behavior. + - `t/03-logreader-web.t`: parses a minimal mocked web access log via `FOOSTATS_WEB_LOGS_GLOB`. +- Introduced env-configurable log globs to enable tests to point at mock logs without touching system paths. +- Added `.perltidyrc` and a `Justfile` with `format`, `test`, `check`, and `clean` tasks to standardize formatting and developer workflow (replaced `Makefile`). + - Added a `reports` task to the `Justfile` to generate Gemtext and HTML reports locally from the repo's `stats/` directory into `out_gmi/` and `out_html/`. + +## 4. Readability and Style + +The script is generally well-written, but a few things could improve its readability: + +* **Remove debugging code:** The script contains `use diagnostics;` and `use Data::Dumper;` with "TODO: UNDO" comments. This debugging code should be removed from the production version of the script. +* **Add comments:** Some of the more complex parts of the script, like the data structures used for statistics, could benefit from comments explaining their structure and purpose. +* **Consistent formatting:** While the formatting is mostly consistent, a tool like `perltidy` could be used to enforce a standard style across the entire script. + +Implemented: +- Removed `use diagnostics;` and `use Data::Dumper;` from production code. +- Added comments describing the stats data model in `Foostats::Aggregator`. +- Added explicit `use HTML::Entities qw(encode_entities);` in `Foostats::Reporter` and documented dependency. + +Open items / Next steps: +- Consider further generalization between web and gemini parsers if beneficial (e.g., a unified parse event builder). +- Optionally run `perltidy` and adopt a formatter config for consistent style across the codebase. + - Consider adding more parser tests (Gemini vger/relayd sample lines, edge cases like query strings and unusual paths). diff --git a/foostats.pl b/foostats.pl index 235da02..4e32a9f 100644 --- a/foostats.pl +++ b/foostats.pl @@ -12,8 +12,8 @@ use experimental qw(builtin); use feature qw(refaliasing); no warnings qw(experimental::refaliasing); -# TODO: UNDO -use diagnostics; +# Debugging aids like diagnostics are noisy in production. +# Removed per review: enable locally when debugging only. use constant VERSION => 'v0.1.0'; @@ -111,10 +111,9 @@ package Foostats::Logreader { use Time::Piece; use String::Util qw(contains startswith endswith); - use constant { - GEMINI_LOGS_GLOB => '/var/log/daemon*', - WEB_LOGS_GLOB => '/var/www/logs/access.log*', - }; + # Make log locations configurable (env overrides) to enable testing. + sub gemini_logs_glob { $ENV{FOOSTATS_GEMINI_LOGS_GLOB} // '/var/log/daemon*' } + sub web_logs_glob { $ENV{FOOSTATS_WEB_LOGS_GLOB} // '/var/www/logs/access.log*' } sub anonymize_ip ($ip) { my $ip_proto = @@ -197,7 +196,7 @@ package Foostats::Logreader { }; } - read_lines WEB_LOGS_GLOB, sub ( $year, @line ) { + read_lines web_logs_glob(), sub ( $year, @line ) { $cb->( parse_web_line @line ); }; } @@ -244,7 +243,7 @@ package Foostats::Logreader { # Expect one vger and one relayd log line per event! So collect # both events (one from one log line each) and then merge the result hash! my ( $vger, $relayd ); - read_lines GEMINI_LOGS_GLOB, sub ( $year, @line ) { + read_lines gemini_logs_glob(), sub ( $year, @line ) { if ( $line[4] eq 'vger:' ) { $vger = parse_vger_line $year, @line; } @@ -321,6 +320,7 @@ package Foostats::Filter { \my $uri_path = \$event->{uri_path}; for ( $self->{odds}->@* ) { + next if !defined $_ || $_ eq '' || /^\s*#/; next unless contains( $uri_path, $_ ); @@ -395,17 +395,21 @@ package Foostats::Aggregator { my $date = $event->{date}; my $date_key = $event->{proto} . "_$date"; + # Stats data model per protocol+day (key: "proto_YYYYMMDD"): + # - count: per-proto request count, per IP version, and filtered count + # - feed_ips: unique IPs per feed type (atom_feed, gemfeed) + # - page_ips: unique IPs per host and per URL $self->{stats}{$date_key} //= { count => { - filtered => 0 + filtered => 0, }, feed_ips => { atom_feed => {}, - gemfeed => {} + gemfeed => {}, }, page_ips => { hosts => {}, - urls => {} + urls => {}, }, }; @@ -434,18 +438,19 @@ package Foostats::Aggregator { \my $f = \$stats->{feed_ips}; \my $e = \$event; - if ( endswith( $e->{uri_path}, ATOM_FEED_URI ) ) { + # Atom feed (exact path match, allow optional query string) + if ( $e->{uri_path} =~ m{^/gemfeed/atom\.xml(?:[?#].*)?$} ) { ( $f->{atom_feed}->{ $e->{ip_hash} } //= 0 )++; + return 1; } - elsif ( contains( $e->{uri_path}, GEMFEED_URI ) ) { - ( $f->{gemfeed}->{ $e->{ip_hash} } //= 0 )++; - } - elsif ( endswith( $e->{uri_path}, GEMFEED_URI_2 ) ) { + + # Gemfeed index: '/gemfeed/' or '/gemfeed/index.gmi' (optionally with query) + if ( $e->{uri_path} =~ m{^/gemfeed/(?:index\.gmi)?(?:[?#].*)?$} ) { ( $f->{gemfeed}->{ $e->{ip_hash} } //= 0 )++; + return 1; } - else { - 0; - } + + return 0; } sub add_page_ips ( $self, $stats, $event ) { @@ -557,7 +562,7 @@ package Foostats::Replicator { } package Foostats::Merger { - use Data::Dumper; # TODO: UNDO + # Removed Data::Dumper (debug-only) per review. sub merge ($stats_dir) { my %merge; @@ -706,6 +711,7 @@ package Foostats::Merger { package Foostats::Reporter { use Time::Piece; + use HTML::Entities qw(encode_entities); sub truncate_url { my ( $url, $max_length ) = @_; @@ -929,16 +935,7 @@ package Foostats::Reporter { return $str; } - # Encode HTML entities to prevent XSS - sub encode_entities { - my ($text) = @_; - $text =~ s/&/&/g; - $text =~ s/</</g; - $text =~ s/>/>/g; - $text =~ s/"/"/g; - $text =~ s/'/'/g; - return $text; - } + # Use HTML::Entities::encode_entities imported above # Generate HTML wrapper sub generate_html_page { @@ -1481,7 +1478,7 @@ sub build_top3_urls_last_n_days_per_day { } } -package main { +package main; use Getopt::Long; use Sys::Hostname; @@ -1524,54 +1521,58 @@ package main { $out->write; } - my ( $parse_logs, $replicate, $report, $all, $help, $version ); - - # With default values - my $stats_dir = '/var/www/htdocs/buetow.org/self/foostats'; - my $odds_file = $stats_dir . '/fooodds.txt'; - my $odds_log = '/var/log/fooodds'; - my $output_dir; # Will default to $stats_dir/gemtext if not specified - my $html_output_dir; # Will default to /var/www/htdocs/gemtexter/stats.foo.zone if not specified - my $partner_node = - hostname eq 'fishfinger.buetow.org' - ? 'blowfish.buetow.org' - : 'fishfinger.buetow.org'; - - GetOptions - 'parse-logs!' => \$parse_logs, - 'filter-log=s' => \$odds_log, - 'odds-file=s' => \$odds_file, - 'replicate!' => \$replicate, - 'report!' => \$report, - 'all!' => \$all, - 'stats-dir=s' => \$stats_dir, - 'output-dir=s' => \$output_dir, - 'html-output-dir=s' => \$html_output_dir, - 'partner-node=s' => \$partner_node, - 'version' => \$version, - 'help|?' => \$help; - - if ($version) { - print "foostats " . VERSION . "\n"; - exit 0; - } + sub foostats_main { + my ( $parse_logs, $replicate, $report, $all, $help, $version ); + + # With default values + my $stats_dir = '/var/www/htdocs/buetow.org/self/foostats'; + my $odds_file = $stats_dir . '/fooodds.txt'; + my $odds_log = '/var/log/fooodds'; + my $output_dir; # Will default to $stats_dir/gemtext if not specified + my $html_output_dir; # Will default to /var/www/htdocs/gemtexter/stats.foo.zone if not specified + my $partner_node = + hostname eq 'fishfinger.buetow.org' + ? 'blowfish.buetow.org' + : 'fishfinger.buetow.org'; + + GetOptions + 'parse-logs!' => \$parse_logs, + 'filter-log=s' => \$odds_log, + 'odds-file=s' => \$odds_file, + 'replicate!' => \$replicate, + 'report!' => \$report, + 'all!' => \$all, + 'stats-dir=s' => \$stats_dir, + 'output-dir=s' => \$output_dir, + 'html-output-dir=s' => \$html_output_dir, + 'partner-node=s' => \$partner_node, + 'version' => \$version, + 'help|?' => \$help; + + if ($version) { + print "foostats " . VERSION . "\n"; + exit 0; + } - usage() if $help; + usage() if $help; - parse_logs( $stats_dir, $odds_file, $odds_log ) - if $parse_logs - or $all; + parse_logs( $stats_dir, $odds_file, $odds_log ) + if $parse_logs + or $all; - Foostats::Replicator::replicate( $stats_dir, $partner_node ) - if $replicate - or $all; + Foostats::Replicator::replicate( $stats_dir, $partner_node ) + if $replicate + or $all; - # Set default output directories if not specified - $output_dir //= '/var/gemini/stats.foo.zone'; - $html_output_dir //= '/var/www/htdocs/gemtexter/stats.foo.zone'; - - Foostats::Reporter::report( $stats_dir, $output_dir, $html_output_dir, - Foostats::Merger::merge($stats_dir) ) - if $report - or $all; -} + # Set default output directories if not specified + $output_dir //= '/var/gemini/stats.foo.zone'; + $html_output_dir //= '/var/www/htdocs/gemtexter/stats.foo.zone'; + + Foostats::Reporter::report( $stats_dir, $output_dir, $html_output_dir, + Foostats::Merger::merge($stats_dir) ) + if $report + or $all; + } + + # Only run main flow when executed as a script, not when required (e.g., tests) + foostats_main() unless caller; diff --git a/t/00-load.t b/t/00-load.t new file mode 100644 index 0000000..af89267 --- /dev/null +++ b/t/00-load.t @@ -0,0 +1,9 @@ +use strict; +use warnings; +use Test::More; + +# Load script without running main +ok( do './foostats.pl', 'loaded foostats.pl' ); + +done_testing; + diff --git a/t/01-aggregator.t b/t/01-aggregator.t new file mode 100644 index 0000000..1b01e02 --- /dev/null +++ b/t/01-aggregator.t @@ -0,0 +1,35 @@ +use strict; +use warnings; +use Test::More; + +ok( do './foostats.pl', 'loaded foostats.pl' ); + +# Ensure odds file exists before creating filter/aggregator +open my $odd, '>', 't/tmp_odds.txt' or die $!; +print $odd "\n"; close $odd; + +my $agg = Foostats::Aggregator->new('t/tmp_odds.txt', 't/tmp_filter_log'); + +my $date = 20250101; + +my $events = [ + { proto => 'web', host => 'example.org', ip_hash => 'ip1', ip_proto => 'IPv4', date => $date, time => '120000', uri_path => '/gemfeed/atom.xml', status => 200 }, + { proto => 'gemini', host => 'example.org', ip_hash => 'ip2', ip_proto => 'IPv6', date => $date, time => '120100', uri_path => '/gemfeed/', status => 20 }, + { proto => 'web', host => 'example.org', ip_hash => 'ip3', ip_proto => 'IPv4', date => $date, time => '120200', uri_path => '/gemfeed/index.gmi', status => 200 }, + { proto => 'web', host => 'example.org', ip_hash => 'ip4', ip_proto => 'IPv4', date => $date, time => '120300', uri_path => '/index.html', status => 200 }, +]; + +$agg->add($_) for @$events; + +my $stats = $agg->{stats}{"web_" . $date}; +ok($stats, 'have web stats for date'); +use Test::More; diag("web stats: ", join(',', sort keys %{$stats->{feed_ips}{atom_feed}})); +is( scalar(keys %{$stats->{feed_ips}{atom_feed}}), 1, 'one atom feed IP'); +is( scalar(keys %{$stats->{feed_ips}{gemfeed}}), 1, 'one gemfeed IP (from web)'); + +my $gstats = $agg->{stats}{"gemini_" . $date}; +ok($gstats, 'have gemini stats for date'); +diag("gemini feed keys: ", join(',', sort keys %{$gstats->{feed_ips}{gemfeed}})); +is( scalar(keys %{$gstats->{feed_ips}{gemfeed}}), 1, 'one gemfeed IP (from gemini)'); + +done_testing; diff --git a/t/02-filter.t b/t/02-filter.t new file mode 100644 index 0000000..902ba81 --- /dev/null +++ b/t/02-filter.t @@ -0,0 +1,24 @@ +use strict; +use warnings; +use Test::More; + +ok( do './foostats.pl', 'loaded foostats.pl' ); + +# Ensure odds file exists +open my $odd, '>', 't/tmp_odds.txt' or die $!; +print $odd "\n"; close $odd; + +my $agg = Foostats::Aggregator->new('t/tmp_odds.txt', 't/tmp_filter_log'); + +my $date = 20250102; + +my $e1 = { proto => 'web', host => 'example.org', ip_hash => 'same', ip_proto => 'IPv4', date => $date, time => '121212', uri_path => '/index.html', status => 200 }; +my $e2 = { %$e1 }; # same ip and same second triggers excessive filter + +$agg->add($e1); +$agg->add($e2); + +my $stats = $agg->{stats}{"web_" . $date}; +is( $stats->{count}{filtered} // 0, 1, 'one filtered request due to excessive rate'); + +done_testing; diff --git a/t/03-logreader-web.t b/t/03-logreader-web.t new file mode 100644 index 0000000..ca5ba39 --- /dev/null +++ b/t/03-logreader-web.t @@ -0,0 +1,28 @@ +use strict; +use warnings; +use Test::More; + +ok( do './foostats.pl', 'loaded foostats.pl' ); + +# Create a minimal access log line approximating combined/forwarded format +my $dir = 't/data'; +mkdir $dir unless -d $dir; +my $log = "$dir/access.log"; +open my $fh, '>', $log or die $!; +print $fh qq{example.org 192.0.2.1 - - [10/Oct/2023:13:55:36 +0000] "GET /gemfeed/index.gmi HTTP/1.1" 200 123 "-" "ua" -\n}; +close $fh; + +$ENV{FOOSTATS_WEB_LOGS_GLOB} = $log; + +my @events; +Foostats::Logreader::parse_web_logs(0, sub { my ($ev) = @_; push @events, $ev if $ev }); + +ok(@events == 1, 'parsed one web log event'); +is($events[0]{proto}, 'web', 'proto parsed'); +is($events[0]{host}, 'example.org', 'host parsed'); +like($events[0]{uri_path}, qr{^/gemfeed/index\.gmi$}, 'URI parsed'); +like($events[0]{date}, qr/^\d{8}$/ , 'date normalized'); +like($events[0]{time}, qr/^\d{6}$/, 'time normalized'); +ok($events[0]{ip_hash}, 'ip hashed'); + +done_testing; diff --git a/t/data/access.log b/t/data/access.log new file mode 100644 index 0000000..6541fee --- /dev/null +++ b/t/data/access.log @@ -0,0 +1 @@ +example.org 192.0.2.1 - - [10/Oct/2023:13:55:36 +0000] "GET /gemfeed/index.gmi HTTP/1.1" 200 123 "-" "ua" - diff --git a/t/tmp_filter_log b/t/tmp_filter_log new file mode 100644 index 0000000..3b07060 --- /dev/null +++ b/t/tmp_filter_log @@ -0,0 +1,25 @@ +WARN: /gemfeed/atom.xml contains and is odd and will therefore be blocked! +WARN: /gemfeed/ contains and is odd and will therefore be blocked! +WARN: /gemfeed/index.gmi contains and is odd and will therefore be blocked! +WARN: /index.html contains and is odd and will therefore be blocked! +WARN: /index.html contains and is odd and will therefore be blocked! +WARN: /gemfeed/atom.xml contains and is odd and will therefore be blocked! +WARN: /gemfeed/ contains and is odd and will therefore be blocked! +WARN: /gemfeed/index.gmi contains and is odd and will therefore be blocked! +WARN: /index.html contains and is odd and will therefore be blocked! +WARN: /gemfeed/atom.xml contains and is odd and will therefore be blocked! +WARN: /gemfeed/ contains and is odd and will therefore be blocked! +WARN: /gemfeed/index.gmi contains and is odd and will therefore be blocked! +WARN: /index.html contains and is odd and will therefore be blocked! +OK: /gemfeed/atom.xml appears fine... +OK: /gemfeed/ appears fine... +OK: /gemfeed/index.gmi appears fine... +OK: /index.html appears fine... +OK: /index.html appears fine... +WARN: same blocked due to excessive requesting... +OK: /gemfeed/atom.xml appears fine... +OK: /gemfeed/ appears fine... +OK: /gemfeed/index.gmi appears fine... +OK: /index.html appears fine... +OK: /index.html appears fine... +WARN: same blocked due to excessive requesting... diff --git a/t/tmp_odds.txt b/t/tmp_odds.txt new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/t/tmp_odds.txt @@ -0,0 +1 @@ + |
