Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Model

All data in Volley flows as Apache Arrow RecordBatch — a columnar, zero-copy, language-interoperable format. This means operators process batches of rows rather than individual records, enabling vectorized computation.

StreamRecord

The StreamRecord wrapper adds metadata to each batch:

StreamRecord {
    batch: RecordBatch,                       // columnar Arrow data (1 or more rows)
    event_time_ms: Option<i64>,               // event timestamp (for windowing)
    trace: Option<RecordTraceContext>,         // per-record trace context (for distributed tracing)
}
  • batch — the actual payload, a columnar Arrow RecordBatch
  • event_time_ms — event timestamp used by window assignment and watermark tracking
  • trace — optional trace context for per-record distributed tracing. Set by the runtime’s sampler or extracted from source message headers (e.g., Kafka traceparent). When present, the runtime creates OpenTelemetry child spans at each pipeline stage.

StreamElement

Records share channels with control signals. The StreamElement enum wraps all possible messages:

StreamElement = Data(StreamRecord)
              | Barrier(CheckpointBarrier)
              | Watermark(Watermark)
  • Data — normal data records flowing through the pipeline
  • Barrier — checkpoint coordination signals injected by the Checkpoint Coordinator
  • Watermark — event-time progress markers for window triggering

Barriers and watermarks flow in-band alongside data records through the same bounded channels. This is what enables aligned checkpoints without pausing the pipeline.

Why Arrow?

  • Columnar format — efficient for analytical operations (aggregations, filters)
  • Zero-copy — data passes between pipeline stages without serialization
  • Multi-row batching — operators process many rows at once, amortizing per-record overhead
  • Ecosystem interop — Arrow is the standard for DataFusion, Parquet, Delta Lake, Iceberg, and most analytics tooling