Data Model

Volley moves your data in batches, not one row at a time. A batch is an Arrow RecordBatch: a chunk of rows stored in columnar form, passed between stages without copying. Operators see whole batches so they can work on many rows at once — the same technique a query engine uses to keep the CPU busy.

If you’ve used DataFusion, Polars, or DuckDB, this is the same RecordBatch you already know. If not: think “a page of a columnar table” — each column is a contiguous array, and operators like filter and sum work over the whole column in one pass.

StreamRecord

The StreamRecord wrapper adds metadata to each batch:

StreamRecord {
    batch: RecordBatch,                       // columnar Arrow data (1 or more rows)
    event_time_ms: Option<i64>,               // event timestamp (for windowing)
    trace: Option<RecordTraceContext>,         // per-record trace context (for distributed tracing)
}

batch — the actual payload, a columnar Arrow RecordBatch
event_time_ms — event timestamp used by window assignment and watermark tracking
trace — optional trace context for per-record distributed tracing. Set by the runtime’s sampler or extracted from source message headers (e.g., Kafka traceparent). When present, the runtime creates OpenTelemetry child spans at each pipeline stage.

StreamElement

Records share channels with control signals. The StreamElement enum wraps all possible messages:

StreamElement = Data(StreamRecord)
              | Barrier(CheckpointBarrier)
              | Watermark(Watermark)

Data — normal data records flowing through the pipeline
Barrier — checkpoint coordination signals injected by the Checkpoint Coordinator
Watermark — event-time progress markers for window triggering

Barriers and watermarks flow in-band alongside data records through the same bounded channels. This is what enables aligned checkpoints without pausing the pipeline.

Why Arrow?

Columnar format — efficient for analytical operations (aggregations, filters)
Zero-copy — data passes between pipeline stages without serialization
Multi-row batching — operators process many rows at once, amortizing per-record overhead
Ecosystem interop — Arrow is the standard for DataFusion, Parquet, Delta Lake, Iceberg, and most analytics tooling

Keyboard shortcuts

Volley Documentation

Data Model

StreamRecord

StreamElement

Why Arrow?