Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Volley SRE Gap Analysis

Date: 2026-03-30 Reference framework: msitarzewski/agency-agents engineering-sre.md


Important Context: Framework vs. Service

Volley is a stream processing framework (library + Kubernetes operator), not a deployed service. This analysis evaluates:

  1. Framework enablement — How well does Volley equip pipeline authors to follow SRE practices?
  2. Framework development practices — How well does Volley’s own CI/testing/release process follow SRE principles?

Org-level concerns (on-call rotations, blameless culture documentation, MTTR tracking) are out of scope — those belong to teams operating Volley-based pipelines, not the framework itself.


Executive Summary

PillarScoreVerdict
SLOs & Error BudgetsC+ (55%)SLI-ready metrics + Prometheus recording/alerting rule templates + SLO definition template; no runtime error budget gauge yet
ObservabilityA- (88%)Strong metrics + tracing + structured JSON logging + alerting rule templates + trace-log correlation + per-connector error counters; missing dashboards
Toil ReductionB- (65%)CI + K8s operator + Dockerfile + Helm chart; no CD pipeline or benchmark regression tracking
Chaos EngineeringC- (40%)Checkpoint recovery unit-tested + fault injection integration tests for checkpoint/RocksDB failures; no Kafka fault injection
Capacity PlanningB- (65%)KEDA autoscaling scaffolded, benchmarks exist, capacity planning guide with sizing tables and PromQL queries
Progressive RolloutsF (5%)Operator uses default StatefulSet RollingUpdate; no canary or automated rollback
Incident Response PrimitivesC+ (55%)PipelineHealth enum + K8s probes + volley.pipeline.health gauge + alerting rules; missing severity-to-SLO mapping docs

Overall: Volley has a strong observability foundation, solid Kubernetes integration, SLO recording rules, alerting templates, structured logging with trace-log correlation, a Dockerfile, Helm chart, SLO definition template, fault injection tests, and capacity planning guide. Remaining gaps are dashboard templates, CD pipeline, progressive rollout support, and Kafka fault injection.


Scoring Methodology

GradeRangeMeaning
A90-100%Production-ready, matches SRE framework recommendations
B70-89%Strong foundation, minor gaps
C50-69%Partial coverage, significant gaps
D25-49%Minimal coverage, major gaps
F0-24%Absent or negligible

Criteria per pillar: API completeness, built-in tooling, documentation for pipeline authors, integration with ecosystem tooling (Prometheus, Grafana, K8s).


Pillar 1: SLOs & Error Budgets — C (50%)

Current State

Volley exposes metrics that serve as natural SLI candidates (volley-observability/src/metrics.rs):

MetricTypeSLI Use
volley.records.processedCounterAvailability — total processed records
volley.errorsCounterAvailability — error count for SLI ratio
volley.stage.latency_msHistogramLatency — processing duration distribution
volley.pipeline.uptime_secondsGaugeAvailability — pipeline uptime
volley.checkpoint.failuresCounterReliability — checkpoint failure rate

The PipelineHealth enum (volley-observability/src/health.rs) provides binary health states (Starting, Running, Degraded, Failed, ShuttingDown) but these are probe-level signals, not SLO-aligned measurements.

What Exists

Prometheus recording rules are provided at deploy/prometheus/recording-rules.yaml computing:

  • volley:availability:ratio_rate5m and volley:availability:ratio_rate30d — availability SLI ratios
  • volley:latency_sli:ratio_rate5m — latency SLI (fraction under 100ms)
  • volley:error_budget:remaining — remaining error budget against 99.9% target
  • volley:throughput:rate5m and volley:saturation:avg — capacity signals

Multi-window burn rate alerting rules are provided at deploy/prometheus/alerting-rules.yaml:

  • VolleyHighErrorBurnRate (critical) — 14.4x burn over 1h confirmed by 5m window
  • VolleySlowErrorBurnRate (warning) — 1x burn over 3d confirmed by 6h window

Remaining Gaps

  • No SLO definition template or configuration format for pipeline authors to declare per-pipeline targets
  • No runtime error budget gauge metric in volley-observability (the recording rule computes it in Prometheus, but no native Rust gauge)
  • No documentation connecting error budget exhaustion to operational decisions (e.g., deployment freeze policy)

Recommendations

(P1) Add an error_budget_remaining gauge to volley-observability that pipeline authors can enable to expose real-time budget consumption natively from the runtime, complementing the Prometheus recording rule.

(P1) Provide an SLO definition template format (YAML) that pipeline authors can use to declare per-pipeline availability and latency targets, which can be consumed by the Prometheus rules.

(P2) Document an error budget policy template: when budget < 25%, freeze non-critical deployments; when budget < 10%, enter incident response mode.


Pillar 2: Observability — B+ (80%)

Current State

Metrics (volley-observability/src/metrics.rs): 15+ named metric constants covering sources, operators, checkpoints, channels, watermarks, sinks, pipeline health, Kafka, and blob storage. All use node_id/job_name labels. Three helper functions (increment_counter, set_gauge, record_histogram) enforce consistent labeling.

Tracing (volley-observability/src/tracing_init.rs): OpenTelemetry integration via OTLP gRPC export with tracing-opentelemetry bridge. Resource attributes include service.name and node.id. Graceful fallback if OTLP endpoint unavailable.

Health (volley-observability/src/health.rs): HealthReporter with atomic state tracking, liveness/readiness semantics. K8s probes configured in operator’s StatefulSet builder.

ServiceMonitor (volley-operator/src/crd.rs): Auto-created by operator with configurable scrape interval and labels.

What Exists

  • Structured JSON logging: ObservabilityConfig supports a log_format field with Text (default) and Json modes. When set to Json, tracing_subscriber::fmt::json() is used for structured output compatible with log aggregation systems (Loki, Elasticsearch, CloudWatch Logs). Pipeline authors enable it via .with_json_logging().
  • Alerting rule templates: Prometheus alerting rules are provided at deploy/prometheus/alerting-rules.yaml covering error burn rate (fast + slow), pipeline health state, checkpoint failures, channel saturation, Kafka consumer lag, and p99 latency.

What Was Added

  • Trace-log correlation: TraceContextFormat in volley-core/src/observability/trace_context.rs safely injects trace_id and span_id fields into JSON log output via serde_json round-trip when an OpenTelemetry span is active, enabling cross-referencing between logs and distributed traces.
  • Per-connector error counters: volley.source.errors and volley.sink.errors metrics defined in volley-observability/src/metrics.rs with source_id/sink_id labels, and incremented in volley-core/src/builder/runtime.rs on poll/write/flush failures. connector_id() method added to Source, Sink, DynSource, and DynSink traits, implemented across Kafka, blob store, and Iceberg connectors.

Remaining Gaps

  • No dashboard templates: No Grafana dashboard JSON mapping the four golden signals to Volley metrics.

Recommendations

(P0) Provide a golden signals Grafana dashboard template using Volley’s actual metrics:

Golden SignalVolley MetricPromQL
Latencyvolley.stage.latency_mshistogram_quantile(0.99, rate(volley_stage_latency_ms_bucket[5m]))
Trafficvolley.source.records_polledsum(rate(volley_source_records_polled_total[5m])) by (job_name)
Errorsvolley.errorssum(rate(volley_errors_total[5m])) by (job_name, error_type)
Saturationvolley.channel.utilizationavg(volley_channel_utilization) by (job_name, stage)

(P1) Enable trace-log correlation by configuring the fmt layer with OpenTelemetry trace context propagation, so JSON log lines include trace_id and span_id fields.

(P2) Add per-connector error counters to metrics.rs (volley.source.errors, volley.sink.errors) with source_id/sink_id labels for finer-grained fault attribution.


Pillar 3: Toil Reduction — B- (65%)

Current State

CI pipeline (.github/workflows/ci.yml): Format checking (cargo fmt), linting (cargo clippy with default + kafka features), testing (default + kafka variants), benchmark compilation check. Uses Rust cache for faster builds.

K8s operator (volley-operator/): Automates StatefulSet, headless Service, metrics Service, ServiceMonitor, PodDisruptionBudget, and KEDA ScaledObject creation from a single VolleyApplication CRD. Server-side apply with idempotent reconciliation.

Checkpoint cleanup: RecoveryCoordinator handles automated old checkpoint cleanup.

What Exists

  • Dockerfile: A multi-stage Dockerfile is provided at deploy/docker/Dockerfile with a Rust builder stage and minimal distroless runtime stage. Pipeline authors can extend or customize it for their specific binaries.
  • Helm chart: A Helm chart exists at deploy/helm/volley/ that templates the VolleyApplication CRD with values.yaml covering all CRD fields (resources, storage, observability, health, autoscaling, Kafka, isolation).

Remaining Gaps

  • Dockerfile not referenced in documentation: The README and operator docs do not mention the Dockerfile or describe how pipeline authors should customize it for their applications.
  • Helm chart has no production-hardening overlays: No example values files for common deployment topologies (small/medium/large, multi-AZ).
  • No CD pipeline: CI exists but no automated release workflow (crate publishing, image building).
  • No benchmark regression tracking: Benchmarks compile-checked but results not tracked over time.

Recommendations

(P1) Add a CD workflow (.github/workflows/release.yml) triggered on git tags that builds container images and publishes to a registry.

(P1) Add production-hardened example values files to the Helm chart (e.g., values-production.yaml with resource limits, anti-affinity, KEDA autoscaling).

(P2) Add benchmark regression detection in CI using criterion with stored baselines.


Pillar 4: Chaos Engineering — D (30%)

Current State

Checkpoint recovery is tested via unit tests — the HealthReporter transitions through Starting -> Running -> Degraded -> Failed states and these are verified. The operator handles reconciliation errors with retry backoff (15s requeue on failure).

Gaps

  • No fault injection tests simulating Kafka broker disconnect, disk full, or OOM conditions.
  • No integration tests exercising the Degraded -> Running recovery path under realistic failure.
  • No Chaos Mesh or LitmusChaos experiment definitions for K8s-deployed pipelines.

Recommendations

(P1) Add integration tests for framework resilience:

  • Checkpoint directory becomes read-only mid-pipeline
  • Kafka broker disconnect during exactly-once transaction commit
  • RocksDB write failure during state flush

(P1) Provide Chaos Mesh experiment templates for K8s-deployed Volley applications:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: volley-pod-kill
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app.kubernetes.io/managed-by: volley-operator
  scheduler:
    cron: "@every 30m"

(P2) Add toxiproxy integration to Kafka integration tests for network degradation simulation.


Pillar 5: Capacity Planning — C (45%)

Current State

Autoscaling: KEDA support fully scaffolded in operator (AutoscalingSpec in CRD) with Kafka lag, Prometheus metric, and CPU-based triggers. ScaledObject is created during reconciliation.

Benchmarks: benchmark_throughput and benchmark_sustained examples demonstrate 9.4M rec/s sustained on a single node.

Resource management: CRD requires explicit resources.requests and resources.limits for CPU/memory.

Gaps

  • No sizing guide mapping resource allocation to expected throughput.
  • No documentation on RocksDB storage growth patterns.
  • No tested KEDA trigger threshold examples.
  • No resource utilization dashboards.

Recommendations

(P1) Create a capacity planning guide covering:

  • CPU/memory requirements per records/sec (based on benchmark data)
  • RocksDB storage growth formula: key_count * avg_value_size * checkpoint_retention_count
  • PVC sizing guidance based on state size and checkpoint interval
  • Key PromQL queries for capacity monitoring:
    • Throughput: rate(volley_source_records_polled_total[5m])
    • Backpressure: volley_channel_utilization > 0.8
    • Input backlog: volley_kafka_consumer_lag

(P2) Document tested KEDA trigger configurations with proven thresholds (e.g., scale at consumer lag > 10000).


Pillar 6: Progressive Rollouts — F (5%)

Current State

The operator creates StatefulSets with a configurable replicas count but uses the default RollingUpdate strategy. No canary, blue-green, or partition-based rollout mechanism exists. imagePullPolicy is configurable but no image tag pinning is enforced.

Gaps

  • No updateStrategy field in the CRD spec.
  • No partition-based canary rollout support.
  • No automated rollback on health check failure post-update.
  • No Argo Rollouts or Flagger integration.

Recommendations

(P1) Add an updateStrategy field to VolleyApplicationSpec CRD with partition-based canary support:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
pub struct UpdateStrategySpec {
    /// Partition value for canary rollouts. Pods with ordinal >= partition
    /// get the new version. Set to replicas-1 to canary a single pod.
    #[serde(default, rename = "partition")]
    pub partition: Option<i32>,
}
}

(P2) Provide an Argo Rollouts AnalysisTemplate example that checks volley_errors_total rate and volley_stage_latency_ms p99 after each rollout step.


Pillar 7: Incident Response Primitives — C+ (55%)

Current State

Health states: PipelineHealth enum provides 5 states — Starting, Running, Degraded (with reason string), Failed (with reason string), ShuttingDown. These map to K8s liveness (healthy = Starting|Running) and readiness (ready = Running only).

Health gauge: volley.pipeline.health gauge is emitted on every state transition via HealthReporter, enabling Prometheus alerting on health state changes. Values: 0=Starting, 1=Running, 2=Degraded, 3=Failed, 4=ShuttingDown.

Operator status: CRD status tracks phases (Pending, Running, Degraded, Failed) with readyReplicas count and human-readable message.

Probes: Liveness at /health (port 8080), readiness at /ready, with configurable initialDelaySeconds and periodSeconds.

Alerting rules: deploy/prometheus/alerting-rules.yaml includes VolleyPipelineFailed (critical, health==3 for 1m) and VolleyPipelineDegraded (warning, health==2 for 5m).

Remaining Gaps

  • No documentation mapping severity levels to PipelineHealth states and SLO impact.
  • No runbook links in alert annotations (placeholder URLs only).

Recommendations

(P1) Document severity-to-health mapping for pipeline authors:

SeverityPipelineHealthSLO ImpactResponse
SEV1FailedSLO breached, pipeline stoppedImmediate: investigate root cause, restore from checkpoint
SEV2DegradedError budget burning above normalUrgent: identify degradation source (e.g., checkpoint failures)
SEV3Running with elevated latencyWithin budget but trendingInvestigate: check channel.utilization, watermark.lag_ms

(P1) Add runbook URLs to the existing alerting rules in deploy/prometheus/alerting-rules.yaml, linking to operational documentation for each failure mode.


Prioritized Implementation Roadmap

P0 — Completed

  1. SLO recording rule templatesdeploy/prometheus/recording-rules.yaml
  2. Burn rate alerting rule templatesdeploy/prometheus/alerting-rules.yaml
  3. Structured JSON loggingObservabilityConfig.log_format with LogFormat::Json
  4. Multi-stage Dockerfiledeploy/docker/Dockerfile
  5. Helm chartdeploy/helm/volley/
  6. Pipeline health gaugevolley.pipeline.health emitted by HealthReporter
  7. Health state alerting rulesVolleyPipelineFailed, VolleyPipelineDegraded in alerting rules

P0 — Remaining

  1. Golden signals dashboard template (Grafana JSON)
  2. Severity-to-PipelineHealth mapping documentation

P1 — Framework maturity

  1. Runtime error budget gauge metric in volley-observability
  2. SLO definition template format for per-pipeline targetsdeploy/prometheus/slo-template.yaml
  3. Trace-log correlation (trace_id in structured log output)volley-core/src/observability/trace_context.rs with TraceContextFormat
  4. Fault injection integration tests (checkpoint, RocksDB failures)volley-core/tests/fault_injection_test.rs (9 tests)
  5. Chaos Mesh experiment templates
  6. Capacity planning guide (sizing, storage growth, key PromQL)docs/src/operations/capacity-planning.md
  7. updateStrategy CRD field for partition-based canary rollouts
  8. Runbook URLs in alerting rule annotations
  9. CD workflow for framework releases
  10. Production-hardened Helm values examples

P2 — Excellence

  1. Per-connector error counters (volley.source.errors, volley.sink.errors) — constants in volley-observability/src/metrics.rs, wired up in volley-core/src/builder/runtime.rs with connector_id() on Source/Sink traits
  2. Benchmark regression detection in CI
  3. Toxiproxy integration for Kafka fault injection
  4. Argo Rollouts AnalysisTemplate example
  5. Tested KEDA trigger threshold documentation
  6. Error budget policy template documentation

Appendix A: Full Metric Inventory

All metrics defined in volley-observability/src/metrics.rs:

ConstantMetric NameTypeLabelsSRE Use
SOURCE_RECORDS_POLLEDvolley.source.records_polledCounternode_id, job_name, source_idTraffic golden signal
RECORDS_PROCESSEDvolley.records.processedCounternode_id, job_name, operator_idThroughput, SLI denominator
STAGE_LATENCY_MSvolley.stage.latency_msHistogramnode_id, job_name, stageLatency golden signal, SLI
CHECKPOINT_DURATION_MSvolley.checkpoint.duration_msHistogramnode_id, job_nameCheckpoint health
CHECKPOINT_EPOCHvolley.checkpoint.epochGaugenode_id, job_nameProgress tracking
CHECKPOINT_FAILURESvolley.checkpoint.failuresCounternode_id, job_nameReliability, alerting
CHANNEL_UTILIZATIONvolley.channel.utilizationGaugenode_id, job_name, stageSaturation golden signal
WATERMARK_LAG_MSvolley.watermark.lag_msGaugenode_id, job_nameEvent-time lag
SINK_RECORDS_WRITTENvolley.sink.records_writtenCounternode_id, job_name, sink_idOutput throughput
SINK_FLUSH_DURATION_MSvolley.sink.flush.duration_msHistogramnode_id, job_name, sink_idSink latency
PIPELINE_UPTIME_SECONDSvolley.pipeline.uptime_secondsGaugenode_id, job_nameAvailability
PIPELINE_HEALTHvolley.pipeline.healthGaugenode_id, job_nameHealth state (0-4), alerting on Failed/Degraded
ERRORS_TOTALvolley.errorsCounternode_id, job_name, error_typeError golden signal, SLI numerator
KAFKA_CONSUMER_LAGvolley.kafka.consumer_lagGaugenode_id, job_nameBacklog, autoscaling trigger
KAFKA_COMMIT_DURATION_MSvolley.kafka.commit.duration_msHistogramnode_id, job_nameKafka commit latency
KAFKA_COMMIT_FAILURESvolley.kafka.commit.failuresCounternode_id, job_nameExactly-once reliability
BLOB_OBJECTS_READvolley.blob.objects_readCounternode_id, job_nameBlob source throughput
BLOB_BYTES_READvolley.blob.bytes_readCounternode_id, job_nameBlob I/O volume

Label constants: LABEL_NODE_ID, LABEL_JOB_NAME, LABEL_STAGE, LABEL_OPERATOR_ID, LABEL_SOURCE_ID, LABEL_SINK_ID, LABEL_ERROR_TYPE

Appendix B: Reference Prometheus Rules

See Pillar 1 for SLO recording rules and burn rate alerts, and Pillar 7 for health state alerting rules.

Appendix C: Golden Signals PromQL Reference

# Latency (p50, p99)
histogram_quantile(0.5, sum(rate(volley_stage_latency_ms_bucket[5m])) by (le, job_name))
histogram_quantile(0.99, sum(rate(volley_stage_latency_ms_bucket[5m])) by (le, job_name))

# Traffic (records/sec by pipeline)
sum(rate(volley_source_records_polled_total[5m])) by (job_name)

# Errors (error rate by type)
sum(rate(volley_errors_total[5m])) by (job_name, error_type)

# Error ratio
sum(rate(volley_errors_total[5m])) by (job_name)
/
sum(rate(volley_records_processed_total[5m])) by (job_name)

# Saturation (channel utilization, higher = more backpressure)
avg(volley_channel_utilization) by (job_name, stage)

# Saturation (Kafka consumer lag)
sum(volley_kafka_consumer_lag) by (job_name)