Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Volley is a high-performance stream processing framework built in Rust on Apache Arrow. It provides a Flink-inspired DataStream API with exactly-once semantics, achieving 9M+ records/sec sustained on a single node.

Volley is designed for both human developers and AI coding agents. The API is type-safe at compile time, the documentation is structured for machine readability, and the examples are runnable out of the box.

Key Features

  • High Performance — 9M+ rec/sec sustained with multi-row Arrow batching and write-behind state caching
  • Exactly-Once Semantics — Checkpoint-based fault tolerance with aligned barrier snapshotting
  • Flink-like API — Fluent DataStream API with typestate-enforced compile-time safety
  • Expression Engine — Vectorized filter_expr(), select_expr(), aggregate_expr() powered by DataFusion (without SQL). Multi-type aggregation on native Arrow types (i32, i64, f64, etc.)
  • Event-Time Windowing — Tumbling, sliding, and session windows with watermark tracking
  • State Management — RocksDB backend with write-behind cache and hardlink checkpoints
  • Intra-Node Parallelism — Hash-partitioned parallel operator instances
  • Kafka Integration — Source and sink with exactly-once transactional semantics
  • Cloud Blob Sources — S3+SQS, Azure Blob+Queue, GCS+Pub/Sub with Parquet/JSON/CSV/Avro
  • Table Format Sinks — Delta Lake and Apache Iceberg with exactly-once commits
  • Kubernetes Operator — VolleyApplication CRD with autoscaling, health checks, and Prometheus

Built On

ComponentPurpose
Apache ArrowIn-memory columnar format
DataFusionQuery engine and expression evaluation
RocksDBState management and checkpointing
TokioAsync runtime for task-per-stage execution

Architecture

+-----------------------------------------------------+
|  Stream Processing Layer (Rust)                     |
|  +- Fluent Builder API (DataStream, KeyedStream)    |
|  +- Multi-Row RecordBatch Operators                 |
|  +- Watermarks & Event Time                         |
|  +- Windowing (Tumbling, Sliding, Session)          |
|  +- Intra-Node Parallelism (hash-partitioned)       |
|  +- Checkpoint Coordinator                          |
+-----------------------+-----------------------------+
                        |
+-----------------------v-----------------------------+
|  Async Runtime (Tokio)                              |
|  +- Task-per-Stage Execution                        |
|  +- Bounded Channels (Backpressure)                 |
|  +- Graceful Shutdown                               |
+-----------------------+-----------------------------+
                        |
+-----------------------v-----------------------------+
|  State Management (RocksDB)                         |
|  +- Write-Behind Accumulator Cache                  |
|  +- BatchStateBackend (multi_get / WriteBatch)      |
|  +- Per-Key State Isolation                         |
|  +- Hardlink Checkpoints                            |
+-----------------------+-----------------------------+
                        |
+-----------------------v-----------------------------+
|  Connectors                                         |
|  +- Kafka (exactly-once, feature-gated)             |
|  +- Cloud Blob Stores (S3, Azure, GCS)              |
|  +- Delta Lake, Iceberg (exactly-once sinks)        |
|  +- Memory (batched source, counting sink)          |
+-----------------------------------------------------+

Performance

Benchmarked on a single node (12 CPUs, Apple Silicon):

ConfigurationThroughput
Single-row, parallelism=1850K rec/s
Batch=1000, parallelism=48.7M rec/s
Sustained 10s (batch=1000, parallelism=4)9.4M rec/s

Connectors

Sources

ConnectorFormatsStatus
KafkaJSONAvailable (feature-gated)
AWS S3 + SQSParquet, JSON, CSV, AvroAvailable
Azure Blob + QueueParquet, JSON, CSV, AvroAvailable
GCS + Pub/SubParquet, JSON, CSV, AvroAvailable
Memory/IteratorArrow RecordBatchAvailable

Sinks

ConnectorFormatsStatus
KafkaJSONAvailable (feature-gated, transactional)
Delta LakeParquetAvailable (exactly-once)
Apache IcebergParquetAvailable (REST catalog)
Memory (collect)Arrow RecordBatchAvailable

Next Steps