Volley is a high-performance stream processing framework built in Rust on Apache Arrow. It provides a Flink-inspired DataStream API with exactly-once semantics, achieving 9M+ records/sec sustained on a single node.
Volley is designed for both human developers and AI coding agents. The API is type-safe at compile time, the documentation is structured for machine readability, and the examples are runnable out of the box.
High Performance — 9M+ rec/sec sustained with multi-row Arrow batching and write-behind state caching
Exactly-Once Semantics — Checkpoint-based fault tolerance with aligned barrier snapshotting
Flink-like API — Fluent DataStream API with typestate-enforced compile-time safety
Expression Engine — Vectorized filter_expr(), select_expr(), aggregate_expr() powered by DataFusion (without SQL). Multi-type aggregation on native Arrow types (i32, i64, f64, etc.)
Event-Time Windowing — Tumbling, sliding, and session windows with watermark tracking
State Management — RocksDB backend with write-behind cache and hardlink checkpoints
Intra-Node Parallelism — Hash-partitioned parallel operator instances
Kafka Integration — Source and sink with exactly-once transactional semantics
Cloud Blob Sources — S3+SQS, Azure Blob+Queue, GCS+Pub/Sub with Parquet/JSON/CSV/Avro
Table Format Sinks — Delta Lake and Apache Iceberg with exactly-once commits
Kubernetes Operator — VolleyApplication CRD with autoscaling, health checks, and Prometheus
Component Purpose
Apache Arrow In-memory columnar format
DataFusion Query engine and expression evaluation
RocksDB State management and checkpointing
Tokio Async runtime for task-per-stage execution
+-----------------------------------------------------+
| Stream Processing Layer (Rust) |
| +- Fluent Builder API (DataStream, KeyedStream) |
| +- Multi-Row RecordBatch Operators |
| +- Watermarks & Event Time |
| +- Windowing (Tumbling, Sliding, Session) |
| +- Intra-Node Parallelism (hash-partitioned) |
| +- Checkpoint Coordinator |
+-----------------------+-----------------------------+
|
+-----------------------v-----------------------------+
| Async Runtime (Tokio) |
| +- Task-per-Stage Execution |
| +- Bounded Channels (Backpressure) |
| +- Graceful Shutdown |
+-----------------------+-----------------------------+
|
+-----------------------v-----------------------------+
| State Management (RocksDB) |
| +- Write-Behind Accumulator Cache |
| +- BatchStateBackend (multi_get / WriteBatch) |
| +- Per-Key State Isolation |
| +- Hardlink Checkpoints |
+-----------------------+-----------------------------+
|
+-----------------------v-----------------------------+
| Connectors |
| +- Kafka (exactly-once, feature-gated) |
| +- Cloud Blob Stores (S3, Azure, GCS) |
| +- Delta Lake, Iceberg (exactly-once sinks) |
| +- Memory (batched source, counting sink) |
+-----------------------------------------------------+
Benchmarked on a single node (12 CPUs, Apple Silicon):
Configuration Throughput
Single-row, parallelism=1 850K rec/s
Batch=1000, parallelism=4 8.7M rec/s
Sustained 10s (batch=1000, parallelism=4) 9.4M rec/s
Connector Formats Status
Kafka JSON Available (feature-gated)
AWS S3 + SQS Parquet, JSON, CSV, Avro Available
Azure Blob + Queue Parquet, JSON, CSV, Avro Available
GCS + Pub/Sub Parquet, JSON, CSV, Avro Available
Memory/Iterator Arrow RecordBatch Available
Connector Formats Status
Kafka JSON Available (feature-gated, transactional)
Delta Lake Parquet Available (exactly-once)
Apache Iceberg Parquet Available (REST catalog)
Memory (collect) Arrow RecordBatch Available