Volley SRE Gap Analysis

Date: 2026-03-30 Reference framework: msitarzewski/agency-agents engineering-sre.md

Important Context: Framework vs. Service

Volley is a stream processing framework (library + Kubernetes operator), not a deployed service. This analysis evaluates:

Framework enablement — How well does Volley equip pipeline authors to follow SRE practices?
Framework development practices — How well does Volley’s own CI/testing/release process follow SRE principles?

Org-level concerns (on-call rotations, blameless culture documentation, MTTR tracking) are out of scope — those belong to teams operating Volley-based pipelines, not the framework itself.

Executive Summary

Pillar	Score	Verdict
SLOs & Error Budgets	C+ (55%)	SLI-ready metrics + Prometheus recording/alerting rule templates + SLO definition template; no runtime error budget gauge yet
Observability	A- (88%)	Strong metrics + tracing + structured JSON logging + alerting rule templates + trace-log correlation + per-connector error counters; missing dashboards
Toil Reduction	B- (65%)	CI + K8s operator + Dockerfile + Helm chart; no CD pipeline or benchmark regression tracking
Chaos Engineering	C- (40%)	Checkpoint recovery unit-tested + fault injection integration tests for checkpoint/RocksDB failures; no Kafka fault injection
Capacity Planning	B- (65%)	KEDA autoscaling scaffolded, benchmarks exist, capacity planning guide with sizing tables and PromQL queries
Progressive Rollouts	F (5%)	Operator uses default StatefulSet RollingUpdate; no canary or automated rollback
Incident Response Primitives	C+ (55%)	PipelineHealth enum + K8s probes + `volley.pipeline.health` gauge + alerting rules; missing severity-to-SLO mapping docs

Overall: Volley has a strong observability foundation, solid Kubernetes integration, SLO recording rules, alerting templates, structured logging with trace-log correlation, a Dockerfile, Helm chart, SLO definition template, fault injection tests, and capacity planning guide. Remaining gaps are dashboard templates, CD pipeline, progressive rollout support, and Kafka fault injection.

Scoring Methodology

Grade	Range	Meaning
A	90-100%	Production-ready, matches SRE framework recommendations
B	70-89%	Strong foundation, minor gaps
C	50-69%	Partial coverage, significant gaps
D	25-49%	Minimal coverage, major gaps
F	0-24%	Absent or negligible

Criteria per pillar: API completeness, built-in tooling, documentation for pipeline authors, integration with ecosystem tooling (Prometheus, Grafana, K8s).

Pillar 1: SLOs & Error Budgets — C (50%)

Current State

Volley exposes metrics that serve as natural SLI candidates (volley-observability/src/metrics.rs):

Metric	Type	SLI Use
`volley.records.processed`	Counter	Availability — total processed records
`volley.errors`	Counter	Availability — error count for SLI ratio
`volley.stage.latency_ms`	Histogram	Latency — processing duration distribution
`volley.pipeline.uptime_seconds`	Gauge	Availability — pipeline uptime
`volley.checkpoint.failures`	Counter	Reliability — checkpoint failure rate

The PipelineHealth enum (volley-observability/src/health.rs) provides binary health states (Starting, Running, Degraded, Failed, ShuttingDown) but these are probe-level signals, not SLO-aligned measurements.

What Exists

Prometheus recording rules are provided at deploy/prometheus/recording-rules.yaml computing:

volley:availability:ratio_rate5m and volley:availability:ratio_rate30d — availability SLI ratios
volley:latency_sli:ratio_rate5m — latency SLI (fraction under 100ms)
volley:error_budget:remaining — remaining error budget against 99.9% target
volley:throughput:rate5m and volley:saturation:avg — capacity signals

Multi-window burn rate alerting rules are provided at deploy/prometheus/alerting-rules.yaml:

VolleyHighErrorBurnRate (critical) — 14.4x burn over 1h confirmed by 5m window
VolleySlowErrorBurnRate (warning) — 1x burn over 3d confirmed by 6h window

Remaining Gaps

No SLO definition template or configuration format for pipeline authors to declare per-pipeline targets
No runtime error budget gauge metric in volley-observability (the recording rule computes it in Prometheus, but no native Rust gauge)
No documentation connecting error budget exhaustion to operational decisions (e.g., deployment freeze policy)

Recommendations

(P1) Add an error_budget_remaining gauge to volley-observability that pipeline authors can enable to expose real-time budget consumption natively from the runtime, complementing the Prometheus recording rule.

(P1) Provide an SLO definition template format (YAML) that pipeline authors can use to declare per-pipeline availability and latency targets, which can be consumed by the Prometheus rules.

(P2) Document an error budget policy template: when budget < 25%, freeze non-critical deployments; when budget < 10%, enter incident response mode.

Pillar 2: Observability — B+ (80%)

Current State

Metrics (volley-observability/src/metrics.rs): 15+ named metric constants covering sources, operators, checkpoints, channels, watermarks, sinks, pipeline health, Kafka, and blob storage. All use node_id/job_name labels. Three helper functions (increment_counter, set_gauge, record_histogram) enforce consistent labeling.

Tracing (volley-observability/src/tracing_init.rs): OpenTelemetry integration via OTLP gRPC export with tracing-opentelemetry bridge. Resource attributes include service.name and node.id. Graceful fallback if OTLP endpoint unavailable.

Health (volley-observability/src/health.rs): HealthReporter with atomic state tracking, liveness/readiness semantics. K8s probes configured in operator’s StatefulSet builder.

ServiceMonitor (volley-operator/src/crd.rs): Auto-created by operator with configurable scrape interval and labels.

What Exists

Structured JSON logging: ObservabilityConfig supports a log_format field with Text (default) and Json modes. When set to Json, tracing_subscriber::fmt::json() is used for structured output compatible with log aggregation systems (Loki, Elasticsearch, CloudWatch Logs). Pipeline authors enable it via .with_json_logging().
Alerting rule templates: Prometheus alerting rules are provided at deploy/prometheus/alerting-rules.yaml covering error burn rate (fast + slow), pipeline health state, checkpoint failures, channel saturation, Kafka consumer lag, and p99 latency.

What Was Added

Trace-log correlation: TraceContextFormat in volley-core/src/observability/trace_context.rs safely injects trace_id and span_id fields into JSON log output via serde_json round-trip when an OpenTelemetry span is active, enabling cross-referencing between logs and distributed traces.
Per-connector error counters: volley.source.errors and volley.sink.errors metrics defined in volley-observability/src/metrics.rs with source_id/sink_id labels, and incremented in volley-core/src/builder/runtime.rs on poll/write/flush failures. connector_id() method added to Source, Sink, DynSource, and DynSink traits, implemented across Kafka, blob store, and Iceberg connectors.

Remaining Gaps

No dashboard templates: No Grafana dashboard JSON mapping the four golden signals to Volley metrics.

Recommendations

(P0) Provide a golden signals Grafana dashboard template using Volley’s actual metrics:

Golden Signal	Volley Metric	PromQL
Latency	`volley.stage.latency_ms`	`histogram_quantile(0.99, rate(volley_stage_latency_ms_bucket[5m]))`
Traffic	`volley.source.records_polled`	`sum(rate(volley_source_records_polled_total[5m])) by (job_name)`
Errors	`volley.errors`	`sum(rate(volley_errors_total[5m])) by (job_name, error_type)`
Saturation	`volley.channel.utilization`	`avg(volley_channel_utilization) by (job_name, stage)`

(P1) Enable trace-log correlation by configuring the fmt layer with OpenTelemetry trace context propagation, so JSON log lines include trace_id and span_id fields.

(P2) Add per-connector error counters to metrics.rs (volley.source.errors, volley.sink.errors) with source_id/sink_id labels for finer-grained fault attribution.

Pillar 3: Toil Reduction — B- (65%)

Current State

CI pipeline (.github/workflows/ci.yml): Format checking (cargo fmt), linting (cargo clippy with default + kafka features), testing (default + kafka variants), benchmark compilation check. Uses Rust cache for faster builds.

K8s operator (volley-operator/): Automates StatefulSet, headless Service, metrics Service, ServiceMonitor, PodDisruptionBudget, and KEDA ScaledObject creation from a single VolleyApplication CRD. Server-side apply with idempotent reconciliation.

Checkpoint cleanup: RecoveryCoordinator handles automated old checkpoint cleanup.

What Exists

Dockerfile: A multi-stage Dockerfile is provided at deploy/docker/Dockerfile with a Rust builder stage and minimal distroless runtime stage. Pipeline authors can extend or customize it for their specific binaries.
Helm chart: A Helm chart exists at deploy/helm/volley/ that templates the VolleyApplication CRD with values.yaml covering all CRD fields (resources, storage, observability, health, autoscaling, Kafka, isolation).

Remaining Gaps

Dockerfile not referenced in documentation: The README and operator docs do not mention the Dockerfile or describe how pipeline authors should customize it for their applications.
Helm chart has no production-hardening overlays: No example values files for common deployment topologies (small/medium/large, multi-AZ).
No CD pipeline: CI exists but no automated release workflow (crate publishing, image building).
No benchmark regression tracking: Benchmarks compile-checked but results not tracked over time.

Recommendations

(P1) Add a CD workflow (.github/workflows/release.yml) triggered on git tags that builds container images and publishes to a registry.

(P1) Add production-hardened example values files to the Helm chart (e.g., values-production.yaml with resource limits, anti-affinity, KEDA autoscaling).

(P2) Add benchmark regression detection in CI using criterion with stored baselines.

Pillar 4: Chaos Engineering — D (30%)

Current State

Checkpoint recovery is tested via unit tests — the HealthReporter transitions through Starting -> Running -> Degraded -> Failed states and these are verified. The operator handles reconciliation errors with retry backoff (15s requeue on failure).

Gaps

No fault injection tests simulating Kafka broker disconnect, disk full, or OOM conditions.
No integration tests exercising the Degraded -> Running recovery path under realistic failure.
No Chaos Mesh or LitmusChaos experiment definitions for K8s-deployed pipelines.

Recommendations

(P1) Add integration tests for framework resilience:

Checkpoint directory becomes read-only mid-pipeline
Kafka broker disconnect during exactly-once transaction commit
RocksDB write failure during state flush

(P1) Provide Chaos Mesh experiment templates for K8s-deployed Volley applications:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: volley-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app.kubernetes.io/managed-by: volley-operator
  scheduler:
    cron: "@every 30m"

(P2) Add toxiproxy integration to Kafka integration tests for network degradation simulation.

Pillar 5: Capacity Planning — C (45%)

Current State

Autoscaling: KEDA support fully scaffolded in operator (AutoscalingSpec in CRD) with Kafka lag, Prometheus metric, and CPU-based triggers. ScaledObject is created during reconciliation.

Benchmarks: NEXMark benchmark suite (volley-benchmark/) provides industry-standard throughput, latency, and memory measurements.

Resource management: CRD requires explicit resources.requests and resources.limits for CPU/memory.

Gaps

No sizing guide mapping resource allocation to expected throughput.
No documentation on RocksDB storage growth patterns.
No tested KEDA trigger threshold examples.
No resource utilization dashboards.

Recommendations

(P1) Create a capacity planning guide covering:

CPU/memory requirements per records/sec (based on benchmark data)
RocksDB storage growth formula: key_count * avg_value_size * checkpoint_retention_count
PVC sizing guidance based on state size and checkpoint interval
Key PromQL queries for capacity monitoring:
- Throughput: rate(volley_source_records_polled_total[5m])
- Backpressure: volley_channel_utilization > 0.8
- Input backlog: volley_kafka_consumer_lag

(P2) Document tested KEDA trigger configurations with proven thresholds (e.g., scale at consumer lag > 10000).

Pillar 6: Progressive Rollouts — F (5%)

Current State

The operator creates StatefulSets with a configurable replicas count but uses the default RollingUpdate strategy. No canary, blue-green, or partition-based rollout mechanism exists. imagePullPolicy is configurable but no image tag pinning is enforced.

Gaps

No updateStrategy field in the CRD spec.
No partition-based canary rollout support.
No automated rollback on health check failure post-update.
No Argo Rollouts or Flagger integration.

Recommendations

(P1) Add an updateStrategy field to VolleyApplicationSpec CRD with partition-based canary support:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
pub struct UpdateStrategySpec {
    /// Partition value for canary rollouts. Pods with ordinal >= partition
    /// get the new version. Set to replicas-1 to canary a single pod.
    #[serde(default, rename = "partition")]
    pub partition: Option<i32>,
}
}

(P2) Provide an Argo Rollouts AnalysisTemplate example that checks volley_errors_total rate and volley_stage_latency_ms p99 after each rollout step.

Pillar 7: Incident Response Primitives — C+ (55%)

Current State

Health states: PipelineHealth enum provides 5 states — Starting, Running, Degraded (with reason string), Failed (with reason string), ShuttingDown. These map to K8s liveness (healthy = Starting|Running) and readiness (ready = Running only).

Health gauge: volley.pipeline.health gauge is emitted on every state transition via HealthReporter, enabling Prometheus alerting on health state changes. Values: 0=Starting, 1=Running, 2=Degraded, 3=Failed, 4=ShuttingDown.

Operator status: CRD status tracks phases (Pending, Running, Degraded, Failed) with readyReplicas count and human-readable message.

Probes: Liveness at /health (port 8080), readiness at /ready, with configurable initialDelaySeconds and periodSeconds.

Alerting rules: deploy/prometheus/alerting-rules.yaml includes VolleyPipelineFailed (critical, health==3 for 1m) and VolleyPipelineDegraded (warning, health==2 for 5m).

Remaining Gaps

No documentation mapping severity levels to PipelineHealth states and SLO impact.
No runbook links in alert annotations (placeholder URLs only).

Recommendations

(P1) Document severity-to-health mapping for pipeline authors:

Severity	PipelineHealth	SLO Impact	Response
SEV1	`Failed`	SLO breached, pipeline stopped	Immediate: investigate root cause, restore from checkpoint
SEV2	`Degraded`	Error budget burning above normal	Urgent: identify degradation source (e.g., checkpoint failures)
SEV3	`Running` with elevated latency	Within budget but trending	Investigate: check `channel.utilization`, `watermark.lag_ms`

(P1) Add runbook URLs to the existing alerting rules in deploy/prometheus/alerting-rules.yaml, linking to operational documentation for each failure mode.

Prioritized Implementation Roadmap

P0 — Completed

~~SLO recording rule templates~~ — deploy/prometheus/recording-rules.yaml
~~Burn rate alerting rule templates~~ — deploy/prometheus/alerting-rules.yaml
~~Structured JSON logging~~ — ObservabilityConfig.log_format with LogFormat::Json
~~Multi-stage Dockerfile~~ — deploy/docker/Dockerfile
~~Helm chart~~ — deploy/helm/volley/
~~Pipeline health gauge~~ — volley.pipeline.health emitted by HealthReporter
~~Health state alerting rules~~ — VolleyPipelineFailed, VolleyPipelineDegraded in alerting rules

P0 — Remaining

Golden signals dashboard template (Grafana JSON)
Severity-to-PipelineHealth mapping documentation

P1 — Framework maturity

Runtime error budget gauge metric in volley-observability
~~SLO definition template format for per-pipeline targets~~ — deploy/prometheus/slo-template.yaml
~~Trace-log correlation (trace_id in structured log output)~~ — volley-core/src/observability/trace_context.rs with TraceContextFormat
~~Fault injection integration tests (checkpoint, RocksDB failures)~~ — volley-core/tests/fault_injection_test.rs (9 tests)
Chaos Mesh experiment templates
~~Capacity planning guide (sizing, storage growth, key PromQL)~~ — docs/src/operations/capacity-planning.md
updateStrategy CRD field for partition-based canary rollouts
Runbook URLs in alerting rule annotations
CD workflow for framework releases
Production-hardened Helm values examples

P2 — Excellence

~~Per-connector error counters (volley.source.errors, volley.sink.errors)~~ — constants in volley-observability/src/metrics.rs, wired up in volley-core/src/builder/runtime.rs with connector_id() on Source/Sink traits
Benchmark regression detection in CI
Toxiproxy integration for Kafka fault injection
Argo Rollouts AnalysisTemplate example
Tested KEDA trigger threshold documentation
Error budget policy template documentation

Appendix A: Full Metric Inventory

All metrics defined in volley-observability/src/metrics.rs:

Constant	Metric Name	Type	Labels	SRE Use
`SOURCE_RECORDS_POLLED`	`volley.source.records_polled`	Counter	`node_id`, `job_name`, `source_id`	Traffic golden signal
`RECORDS_PROCESSED`	`volley.records.processed`	Counter	`node_id`, `job_name`, `operator_id`	Throughput, SLI denominator
`STAGE_LATENCY_MS`	`volley.stage.latency_ms`	Histogram	`node_id`, `job_name`, `stage`	Latency golden signal, SLI
`CHECKPOINT_DURATION_MS`	`volley.checkpoint.duration_ms`	Histogram	`node_id`, `job_name`	Checkpoint health
`CHECKPOINT_EPOCH`	`volley.checkpoint.epoch`	Gauge	`node_id`, `job_name`	Progress tracking
`CHECKPOINT_FAILURES`	`volley.checkpoint.failures`	Counter	`node_id`, `job_name`	Reliability, alerting
`CHANNEL_UTILIZATION`	`volley.channel.utilization`	Gauge	`node_id`, `job_name`, `stage`	Saturation golden signal
`WATERMARK_LAG_MS`	`volley.watermark.lag_ms`	Gauge	`node_id`, `job_name`	Event-time lag
`SINK_RECORDS_WRITTEN`	`volley.sink.records_written`	Counter	`node_id`, `job_name`, `sink_id`	Output throughput
`SINK_FLUSH_DURATION_MS`	`volley.sink.flush.duration_ms`	Histogram	`node_id`, `job_name`, `sink_id`	Sink latency
`PIPELINE_UPTIME_SECONDS`	`volley.pipeline.uptime_seconds`	Gauge	`node_id`, `job_name`	Availability
`PIPELINE_HEALTH`	`volley.pipeline.health`	Gauge	`node_id`, `job_name`	Health state (0-4), alerting on Failed/Degraded
`ERRORS_TOTAL`	`volley.errors`	Counter	`node_id`, `job_name`, `error_type`	Error golden signal, SLI numerator
`KAFKA_CONSUMER_LAG`	`volley.kafka.consumer_lag`	Gauge	`node_id`, `job_name`	Backlog, autoscaling trigger
`KAFKA_COMMIT_DURATION_MS`	`volley.kafka.commit.duration_ms`	Histogram	`node_id`, `job_name`	Kafka commit latency
`KAFKA_COMMIT_FAILURES`	`volley.kafka.commit.failures`	Counter	`node_id`, `job_name`	Exactly-once reliability
`BLOB_OBJECTS_READ`	`volley.blob.objects_read`	Counter	`node_id`, `job_name`	Blob source throughput
`BLOB_BYTES_READ`	`volley.blob.bytes_read`	Counter	`node_id`, `job_name`	Blob I/O volume

Label constants: LABEL_NODE_ID, LABEL_JOB_NAME, LABEL_STAGE, LABEL_OPERATOR_ID, LABEL_SOURCE_ID, LABEL_SINK_ID, LABEL_ERROR_TYPE

Appendix B: Reference Prometheus Rules

See Pillar 1 for SLO recording rules and burn rate alerts, and Pillar 7 for health state alerting rules.

Appendix C: Golden Signals PromQL Reference

# Latency (p50, p99)
histogram_quantile(0.5, sum(rate(volley_stage_latency_ms_bucket[5m])) by (le, job_name))
histogram_quantile(0.99, sum(rate(volley_stage_latency_ms_bucket[5m])) by (le, job_name))

# Traffic (records/sec by pipeline)
sum(rate(volley_source_records_polled_total[5m])) by (job_name)

# Errors (error rate by type)
sum(rate(volley_errors_total[5m])) by (job_name, error_type)

# Error ratio
sum(rate(volley_errors_total[5m])) by (job_name)
/
sum(rate(volley_records_processed_total[5m])) by (job_name)

# Saturation (channel utilization, higher = more backpressure)
avg(volley_channel_utilization) by (job_name, stage)

# Saturation (Kafka consumer lag)
sum(volley_kafka_consumer_lag) by (job_name)

Keyboard shortcuts

Volley Documentation