NEXMark Benchmarks
NEXMark is the standard workload used to compare stream processing systems. It simulates an online auction — people, bids, auctions, a steady event rate — and defines a handful of queries (projection, windowed aggregation, stream-to-stream join) that together exercise the throughput, latency, and stateful-operator paths that matter in real pipelines.
Volley ships five of those queries as runnable examples so you can measure what your hardware does with your data shape, and compare against Flink or any other system that also publishes NEXMark numbers.
Queries
| Query | Description | What it tests |
|---|---|---|
| Q1 | Currency Conversion | Raw throughput — stateless map/projection |
| Q4 | Average Selling Price per Category | Windowed join between bids and auctions; post-join filtering; min-watermark across dual sources |
| Q5 | Hot Items | Sliding window aggregation under sustained load |
| Q7 | Highest Bid per Session | Session window merging; high-cardinality keyed state; RocksDB under sustained load |
| Q8 | Monitor New Users | Windowed stream-to-stream join |
Running
# Q8 — Windowed join (multi-source pipeline)
cargo run --example nexmark_q8 -p volley-benchmark --release
# Q5 — Sliding window aggregation (recommended starting point)
cargo run --example nexmark_q5 -p volley-benchmark --release
# Q4 — Windowed join (average selling price per category)
cargo run --example nexmark_q4 -p volley-benchmark --release
# Q7 — Session windows (highest bid per session per bidder)
cargo run --example nexmark_q7 -p volley-benchmark --release
# Q1 — Stateless projection baseline
cargo run --example nexmark_q1 -p volley-benchmark --release
Configuration
| Variable | Default | Description |
|---|---|---|
NEXMARK_DURATION | 10 | Benchmark duration in seconds |
NEXMARK_BATCH_SIZE | 4000 | Records per Arrow RecordBatch |
NEXMARK_PARALLELISM | 4 | Parallel operator instances (Q5) |
NEXMARK_AUCTIONS | 100 | Number of active auctions (Q5) |
NEXMARK_WINDOW_SIZE | 10 | Window size in seconds (Q5) |
NEXMARK_WINDOW_SLIDE | 2 | Window slide in seconds (Q5) |
NEXMARK_INTER_EVENT_MS | 1 | Inter-event time in milliseconds |
Metrics
- Sustained throughput (records/sec) — measured over the full run, not peak
- p50/p99/p999 latency — per-batch processing time from source to sink
- Memory footprint — RSS delta under load
- Recovery time — restore from checkpoint after failure (planned)
Comparison Baseline
When comparing against Apache Flink:
- Use tuned Flink — RocksDB state backend, off-heap memory, equivalent parallelism
- Match batch semantics — set Flink’s
table.exec.mini-batch.sizeto match Volley’s batch size - Same hardware, same event rate, same total events
- Discard first 10% of measurements for JIT warmup (Flink) and cache warming (both)
See volley-benchmark/README.md for full methodology.
Stress testing
In addition to the short benchmark runs above, Volley ships a stress testing suite for long-running pass/fail validation. Stress scenarios run for hours against Kafka-backed pipelines, auto-calibrate the throughput ceiling, and assert hard SLO thresholds.
Scenarios
| Scenario | What it tests |
|---|---|
stress_sustained | Sliding window aggregation at 80% of calibrated ceiling for 2 h — verifies no throughput drift, latency regression, or memory growth |
stress_backpressure | Three-phase overload test: baseline → 120% ceiling → recovery — asserts catch-up within 60 s |
stress_checkpoint_recovery | Crash mid-run and restart from checkpoint — asserts state restoration within 30 s |
stress_nexmark_q4 | Kafka-backed Q4 (windowed join) under sustained load |
stress_nexmark_q7 | Kafka-backed Q7 (session windows, high cardinality) under sustained load |
Running
# All scenarios — default 2 h each (requires Docker for Kafka)
./scripts/stress-test.sh
# CI mode — 10 min per scenario
./scripts/stress-test.sh --ci
# Single scenario with custom settings
STRESS_DURATION=30m STRESS_TARGET_RATE=5000 ./scripts/stress-test.sh sustained
Pass/fail thresholds
| Threshold | Default limit |
|---|---|
| Throughput drift | ≤ 5% drop from calibrated operating point |
| p99 latency | ≤ 2× warmup baseline |
| p999 latency | ≤ 5× warmup baseline |
| Memory growth | ≤ 20% RSS increase from warmup to end |
| Checkpoint duration | ≤ 2× warmup baseline |
Reports are written as JSON to target/stress-reports/ and uploaded as a CI artifact when run via the stress.yml GitHub Actions workflow.