Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Volley is a Rust library for building streaming pipelines — read from Kafka or a bucket, filter and aggregate with SQL-style expressions, write to Kafka, Delta, Iceberg, or a database, and have every record counted exactly once even when things crash. Pipelines are single Rust binaries: no JVM, no separate job server, and every operator is checked by the compiler before it ever runs.

If you’ve used Flink or Kafka Streams, the shape is familiar — a fluent DataStream builder, event-time windows, checkpoints. If you’ve used DataFusion, Polars, or DuckDB, the operators are familiar — expressions over Arrow RecordBatches. Volley is the place those two worlds meet.

The API is type-safe at compile time, the docs are structured for both humans and AI coding agents, and every example below is runnable out of the box.

Key Features

  • High Performance — Native Arrow batching with write-behind state caching. Validated via NEXMark benchmarks
  • Exactly-Once Semantics — Checkpoint-based fault tolerance with aligned barrier snapshotting
  • Flink-like API — Fluent DataStream API with typestate-enforced compile-time safety
  • Expression Engine — Vectorized filter_expr(), select_expr(), aggregate_expr() powered by DataFusion. You write the same kinds of predicates and projections you’d write in SQL (col("price").gt(lit(100)), sum(col("amount"))) and they run column-at-a-time on native Arrow types
  • Event-Time Windowing — Tumbling, sliding, and session windows with watermark tracking
  • State Management — RocksDB backend with write-behind cache and hardlink checkpoints
  • Intra-Node Parallelism — Hash-partitioned parallel operator instances
  • Kafka Integration — Source and sink with exactly-once transactional semantics
  • Cloud Blob Sources — S3+SQS and Azure Blob+Queue with Parquet/JSON/CSV/Avro
  • Cloud Blob Sinks — S3, Azure Blob, and GCS with Parquet/JSON Lines/CSV
  • Table Format Sinks — Delta Lake and Apache Iceberg with exactly-once commits
  • Kubernetes Operator — VolleyApplication CRD with autoscaling, health checks, and Prometheus

Built On

ComponentPurpose
Apache ArrowIn-memory columnar format
DataFusionQuery engine and expression evaluation
RocksDBState management and checkpointing
TokioAsync runtime for task-per-stage execution

Architecture

+-----------------------------------------------------+
|  Stream Processing Layer (Rust)                     |
|  +- Fluent Builder API (DataStream, KeyedStream)    |
|  +- Multi-Row RecordBatch Operators                 |
|  +- Watermarks & Event Time                         |
|  +- Windowing (Tumbling, Sliding, Session)          |
|  +- Intra-Node Parallelism (hash-partitioned)       |
|  +- Checkpoint Coordinator                          |
+-----------------------+-----------------------------+
                        |
+-----------------------v-----------------------------+
|  Async Runtime (Tokio)                              |
|  +- Task-per-Stage Execution                        |
|  +- Bounded Channels (Backpressure)                 |
|  +- Graceful Shutdown                               |
+-----------------------+-----------------------------+
                        |
+-----------------------v-----------------------------+
|  State Management (RocksDB)                         |
|  +- Write-Behind Accumulator Cache                  |
|  +- BatchStateBackend (multi_get / WriteBatch)      |
|  +- Per-Key State Isolation                         |
|  +- Hardlink Checkpoints                            |
+-----------------------+-----------------------------+
                        |
+-----------------------v-----------------------------+
|  Connectors                                         |
|  +- Kafka (exactly-once, feature-gated)             |
|  +- Cloud Blob Stores (S3, Azure, GCS)              |
|  +- Delta Lake, Iceberg (exactly-once sinks)        |
|  +- Memory (batched source, counting sink)          |
+-----------------------------------------------------+

Performance

Volley uses the NEXMark benchmark suite for performance evaluation — the industry standard for stream processing systems. See volley-benchmark/ for methodology, results, and Flink comparison baseline.

Connectors

Sources

ConnectorFormatsStatus
KafkaJSON, Protobuf, Avro (Schema Registry)Available (feature-gated)
AWS S3 + SQSParquet, JSON, CSV, AvroAvailable
Azure Blob + QueueParquet, JSON, CSV, AvroAvailable
Memory/IteratorArrow RecordBatchAvailable

Sinks

ConnectorFormatsStatus
KafkaJSON, Protobuf, Avro (Schema Registry)Available (feature-gated, transactional)
AWS S3 · Azure Blob · GCSParquet, JSON Lines, CSVAvailable (BufferedBlobSink)
Delta LakeParquetAvailable (exactly-once)
Apache IcebergParquetAvailable (REST catalog)
Memory (collect)Arrow RecordBatchAvailable

Next Steps