Introduction
Volley is a Rust library for building streaming pipelines — read from Kafka or a bucket, filter and aggregate with SQL-style expressions, write to Kafka, Delta, Iceberg, or a database, and have every record counted exactly once even when things crash. Pipelines are single Rust binaries: no JVM, no separate job server, and every operator is checked by the compiler before it ever runs.
If you’ve used Flink or Kafka Streams, the shape is familiar — a fluent DataStream builder, event-time windows, checkpoints. If you’ve used DataFusion, Polars, or DuckDB, the operators are familiar — expressions over Arrow RecordBatches. Volley is the place those two worlds meet.
The API is type-safe at compile time, the docs are structured for both humans and AI coding agents, and every example below is runnable out of the box.
Key Features
- High Performance — Native Arrow batching with write-behind state caching. Validated via NEXMark benchmarks
- Exactly-Once Semantics — Checkpoint-based fault tolerance with aligned barrier snapshotting
- Flink-like API — Fluent DataStream API with typestate-enforced compile-time safety
- Expression Engine — Vectorized
filter_expr(),select_expr(),aggregate_expr()powered by DataFusion. You write the same kinds of predicates and projections you’d write in SQL (col("price").gt(lit(100)),sum(col("amount"))) and they run column-at-a-time on native Arrow types - Event-Time Windowing — Tumbling, sliding, and session windows with watermark tracking
- State Management — RocksDB backend with write-behind cache and hardlink checkpoints
- Intra-Node Parallelism — Hash-partitioned parallel operator instances
- Kafka Integration — Source and sink with exactly-once transactional semantics
- Cloud Blob Sources — S3+SQS and Azure Blob+Queue with Parquet/JSON/CSV/Avro
- Cloud Blob Sinks — S3, Azure Blob, and GCS with Parquet/JSON Lines/CSV
- Table Format Sinks — Delta Lake and Apache Iceberg with exactly-once commits
- Kubernetes Operator — VolleyApplication CRD with autoscaling, health checks, and Prometheus
Built On
| Component | Purpose |
|---|---|
| Apache Arrow | In-memory columnar format |
| DataFusion | Query engine and expression evaluation |
| RocksDB | State management and checkpointing |
| Tokio | Async runtime for task-per-stage execution |
Architecture
+-----------------------------------------------------+
| Stream Processing Layer (Rust) |
| +- Fluent Builder API (DataStream, KeyedStream) |
| +- Multi-Row RecordBatch Operators |
| +- Watermarks & Event Time |
| +- Windowing (Tumbling, Sliding, Session) |
| +- Intra-Node Parallelism (hash-partitioned) |
| +- Checkpoint Coordinator |
+-----------------------+-----------------------------+
|
+-----------------------v-----------------------------+
| Async Runtime (Tokio) |
| +- Task-per-Stage Execution |
| +- Bounded Channels (Backpressure) |
| +- Graceful Shutdown |
+-----------------------+-----------------------------+
|
+-----------------------v-----------------------------+
| State Management (RocksDB) |
| +- Write-Behind Accumulator Cache |
| +- BatchStateBackend (multi_get / WriteBatch) |
| +- Per-Key State Isolation |
| +- Hardlink Checkpoints |
+-----------------------+-----------------------------+
|
+-----------------------v-----------------------------+
| Connectors |
| +- Kafka (exactly-once, feature-gated) |
| +- Cloud Blob Stores (S3, Azure, GCS) |
| +- Delta Lake, Iceberg (exactly-once sinks) |
| +- Memory (batched source, counting sink) |
+-----------------------------------------------------+
Performance
Volley uses the NEXMark benchmark suite for performance evaluation — the industry standard for stream processing systems. See volley-benchmark/ for methodology, results, and Flink comparison baseline.
Connectors
Sources
| Connector | Formats | Status |
|---|---|---|
| Kafka | JSON, Protobuf, Avro (Schema Registry) | Available (feature-gated) |
| AWS S3 + SQS | Parquet, JSON, CSV, Avro | Available |
| Azure Blob + Queue | Parquet, JSON, CSV, Avro | Available |
| Memory/Iterator | Arrow RecordBatch | Available |
Sinks
| Connector | Formats | Status |
|---|---|---|
| Kafka | JSON, Protobuf, Avro (Schema Registry) | Available (feature-gated, transactional) |
| AWS S3 · Azure Blob · GCS | Parquet, JSON Lines, CSV | Available (BufferedBlobSink) |
| Delta Lake | Parquet | Available (exactly-once) |
| Apache Iceberg | Parquet | Available (REST catalog) |
| Memory (collect) | Arrow RecordBatch | Available |
Next Steps
- Installation — set up your environment
- Quick Start — build your first pipeline in 5 minutes
- Architecture Overview — understand how Volley works
- API Reference — auto-generated rustdoc for all crates