Data Model
All data in Volley flows as Apache Arrow RecordBatch — a columnar, zero-copy, language-interoperable format. This means operators process batches of rows rather than individual records, enabling vectorized computation.
StreamRecord
The StreamRecord wrapper adds metadata to each batch:
StreamRecord {
batch: RecordBatch, // columnar Arrow data (1 or more rows)
event_time_ms: Option<i64>, // event timestamp (for windowing)
trace: Option<RecordTraceContext>, // per-record trace context (for distributed tracing)
}
batch— the actual payload, a columnar Arrow RecordBatchevent_time_ms— event timestamp used by window assignment and watermark trackingtrace— optional trace context for per-record distributed tracing. Set by the runtime’s sampler or extracted from source message headers (e.g., Kafka traceparent). When present, the runtime creates OpenTelemetry child spans at each pipeline stage.
StreamElement
Records share channels with control signals. The StreamElement enum wraps all possible messages:
StreamElement = Data(StreamRecord)
| Barrier(CheckpointBarrier)
| Watermark(Watermark)
- Data — normal data records flowing through the pipeline
- Barrier — checkpoint coordination signals injected by the Checkpoint Coordinator
- Watermark — event-time progress markers for window triggering
Barriers and watermarks flow in-band alongside data records through the same bounded channels. This is what enables aligned checkpoints without pausing the pipeline.
Why Arrow?
- Columnar format — efficient for analytical operations (aggregations, filters)
- Zero-copy — data passes between pipeline stages without serialization
- Multi-row batching — operators process many rows at once, amortizing per-record overhead
- Ecosystem interop — Arrow is the standard for DataFusion, Parquet, Delta Lake, Iceberg, and most analytics tooling