Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Model

Volley moves your data in batches, not one row at a time. A batch is an Arrow RecordBatch: a chunk of rows stored in columnar form, passed between stages without copying. Operators see whole batches so they can work on many rows at once — the same technique a query engine uses to keep the CPU busy.

If you’ve used DataFusion, Polars, or DuckDB, this is the same RecordBatch you already know. If not: think “a page of a columnar table” — each column is a contiguous array, and operators like filter and sum work over the whole column in one pass.

StreamRecord

The StreamRecord wrapper adds metadata to each batch:

StreamRecord {
    batch: RecordBatch,                       // columnar Arrow data (1 or more rows)
    event_time_ms: Option<i64>,               // event timestamp (for windowing)
    trace: Option<RecordTraceContext>,         // per-record trace context (for distributed tracing)
}
  • batch — the actual payload, a columnar Arrow RecordBatch
  • event_time_ms — event timestamp used by window assignment and watermark tracking
  • trace — optional trace context for per-record distributed tracing. Set by the runtime’s sampler or extracted from source message headers (e.g., Kafka traceparent). When present, the runtime creates OpenTelemetry child spans at each pipeline stage.

StreamElement

Records share channels with control signals. The StreamElement enum wraps all possible messages:

StreamElement = Data(StreamRecord)
              | Barrier(CheckpointBarrier)
              | Watermark(Watermark)
  • Data — normal data records flowing through the pipeline
  • Barrier — checkpoint coordination signals injected by the Checkpoint Coordinator
  • Watermark — event-time progress markers for window triggering

Barriers and watermarks flow in-band alongside data records through the same bounded channels. This is what enables aligned checkpoints without pausing the pipeline.

Why Arrow?

  • Columnar format — efficient for analytical operations (aggregations, filters)
  • Zero-copy — data passes between pipeline stages without serialization
  • Multi-row batching — operators process many rows at once, amortizing per-record overhead
  • Ecosystem interop — Arrow is the standard for DataFusion, Parquet, Delta Lake, Iceberg, and most analytics tooling