Volley SRE Gap Analysis
Date: 2026-03-30 Reference framework: msitarzewski/agency-agents engineering-sre.md
Important Context: Framework vs. Service
Volley is a stream processing framework (library + Kubernetes operator), not a deployed service. This analysis evaluates:
- Framework enablement — How well does Volley equip pipeline authors to follow SRE practices?
- Framework development practices — How well does Volley’s own CI/testing/release process follow SRE principles?
Org-level concerns (on-call rotations, blameless culture documentation, MTTR tracking) are out of scope — those belong to teams operating Volley-based pipelines, not the framework itself.
Executive Summary
| Pillar | Score | Verdict |
|---|---|---|
| SLOs & Error Budgets | C+ (55%) | SLI-ready metrics + Prometheus recording/alerting rule templates + SLO definition template; no runtime error budget gauge yet |
| Observability | A- (88%) | Strong metrics + tracing + structured JSON logging + alerting rule templates + trace-log correlation + per-connector error counters; missing dashboards |
| Toil Reduction | B- (65%) | CI + K8s operator + Dockerfile + Helm chart; no CD pipeline or benchmark regression tracking |
| Chaos Engineering | C- (40%) | Checkpoint recovery unit-tested + fault injection integration tests for checkpoint/RocksDB failures; no Kafka fault injection |
| Capacity Planning | B- (65%) | KEDA autoscaling scaffolded, benchmarks exist, capacity planning guide with sizing tables and PromQL queries |
| Progressive Rollouts | F (5%) | Operator uses default StatefulSet RollingUpdate; no canary or automated rollback |
| Incident Response Primitives | C+ (55%) | PipelineHealth enum + K8s probes + volley.pipeline.health gauge + alerting rules; missing severity-to-SLO mapping docs |
Overall: Volley has a strong observability foundation, solid Kubernetes integration, SLO recording rules, alerting templates, structured logging with trace-log correlation, a Dockerfile, Helm chart, SLO definition template, fault injection tests, and capacity planning guide. Remaining gaps are dashboard templates, CD pipeline, progressive rollout support, and Kafka fault injection.
Scoring Methodology
| Grade | Range | Meaning |
|---|---|---|
| A | 90-100% | Production-ready, matches SRE framework recommendations |
| B | 70-89% | Strong foundation, minor gaps |
| C | 50-69% | Partial coverage, significant gaps |
| D | 25-49% | Minimal coverage, major gaps |
| F | 0-24% | Absent or negligible |
Criteria per pillar: API completeness, built-in tooling, documentation for pipeline authors, integration with ecosystem tooling (Prometheus, Grafana, K8s).
Pillar 1: SLOs & Error Budgets — C (50%)
Current State
Volley exposes metrics that serve as natural SLI candidates (volley-observability/src/metrics.rs):
| Metric | Type | SLI Use |
|---|---|---|
volley.records.processed | Counter | Availability — total processed records |
volley.errors | Counter | Availability — error count for SLI ratio |
volley.stage.latency_ms | Histogram | Latency — processing duration distribution |
volley.pipeline.uptime_seconds | Gauge | Availability — pipeline uptime |
volley.checkpoint.failures | Counter | Reliability — checkpoint failure rate |
The PipelineHealth enum (volley-observability/src/health.rs) provides binary health states (Starting, Running, Degraded, Failed, ShuttingDown) but these are probe-level signals, not SLO-aligned measurements.
What Exists
Prometheus recording rules are provided at deploy/prometheus/recording-rules.yaml computing:
volley:availability:ratio_rate5mandvolley:availability:ratio_rate30d— availability SLI ratiosvolley:latency_sli:ratio_rate5m— latency SLI (fraction under 100ms)volley:error_budget:remaining— remaining error budget against 99.9% targetvolley:throughput:rate5mandvolley:saturation:avg— capacity signals
Multi-window burn rate alerting rules are provided at deploy/prometheus/alerting-rules.yaml:
VolleyHighErrorBurnRate(critical) — 14.4x burn over 1h confirmed by 5m windowVolleySlowErrorBurnRate(warning) — 1x burn over 3d confirmed by 6h window
Remaining Gaps
- No SLO definition template or configuration format for pipeline authors to declare per-pipeline targets
- No runtime error budget gauge metric in
volley-observability(the recording rule computes it in Prometheus, but no native Rust gauge) - No documentation connecting error budget exhaustion to operational decisions (e.g., deployment freeze policy)
Recommendations
(P1) Add an error_budget_remaining gauge to volley-observability that pipeline authors can enable to expose real-time budget consumption natively from the runtime, complementing the Prometheus recording rule.
(P1) Provide an SLO definition template format (YAML) that pipeline authors can use to declare per-pipeline availability and latency targets, which can be consumed by the Prometheus rules.
(P2) Document an error budget policy template: when budget < 25%, freeze non-critical deployments; when budget < 10%, enter incident response mode.
Pillar 2: Observability — B+ (80%)
Current State
Metrics (volley-observability/src/metrics.rs): 15+ named metric constants covering sources, operators, checkpoints, channels, watermarks, sinks, pipeline health, Kafka, and blob storage. All use node_id/job_name labels. Three helper functions (increment_counter, set_gauge, record_histogram) enforce consistent labeling.
Tracing (volley-observability/src/tracing_init.rs): OpenTelemetry integration via OTLP gRPC export with tracing-opentelemetry bridge. Resource attributes include service.name and node.id. Graceful fallback if OTLP endpoint unavailable.
Health (volley-observability/src/health.rs): HealthReporter with atomic state tracking, liveness/readiness semantics. K8s probes configured in operator’s StatefulSet builder.
ServiceMonitor (volley-operator/src/crd.rs): Auto-created by operator with configurable scrape interval and labels.
What Exists
- Structured JSON logging:
ObservabilityConfigsupports alog_formatfield withText(default) andJsonmodes. When set toJson,tracing_subscriber::fmt::json()is used for structured output compatible with log aggregation systems (Loki, Elasticsearch, CloudWatch Logs). Pipeline authors enable it via.with_json_logging(). - Alerting rule templates: Prometheus alerting rules are provided at
deploy/prometheus/alerting-rules.yamlcovering error burn rate (fast + slow), pipeline health state, checkpoint failures, channel saturation, Kafka consumer lag, and p99 latency.
What Was Added
- Trace-log correlation:
TraceContextFormatinvolley-core/src/observability/trace_context.rssafely injectstrace_idandspan_idfields into JSON log output viaserde_jsonround-trip when an OpenTelemetry span is active, enabling cross-referencing between logs and distributed traces. - Per-connector error counters:
volley.source.errorsandvolley.sink.errorsmetrics defined involley-observability/src/metrics.rswithsource_id/sink_idlabels, and incremented involley-core/src/builder/runtime.rson poll/write/flush failures.connector_id()method added toSource,Sink,DynSource, andDynSinktraits, implemented across Kafka, blob store, and Iceberg connectors.
Remaining Gaps
- No dashboard templates: No Grafana dashboard JSON mapping the four golden signals to Volley metrics.
Recommendations
(P0) Provide a golden signals Grafana dashboard template using Volley’s actual metrics:
| Golden Signal | Volley Metric | PromQL |
|---|---|---|
| Latency | volley.stage.latency_ms | histogram_quantile(0.99, rate(volley_stage_latency_ms_bucket[5m])) |
| Traffic | volley.source.records_polled | sum(rate(volley_source_records_polled_total[5m])) by (job_name) |
| Errors | volley.errors | sum(rate(volley_errors_total[5m])) by (job_name, error_type) |
| Saturation | volley.channel.utilization | avg(volley_channel_utilization) by (job_name, stage) |
(P1) Enable trace-log correlation by configuring the fmt layer with OpenTelemetry trace context propagation, so JSON log lines include trace_id and span_id fields.
(P2) Add per-connector error counters to metrics.rs (volley.source.errors, volley.sink.errors) with source_id/sink_id labels for finer-grained fault attribution.
Pillar 3: Toil Reduction — B- (65%)
Current State
CI pipeline (.github/workflows/ci.yml): Format checking (cargo fmt), linting (cargo clippy with default + kafka features), testing (default + kafka variants), benchmark compilation check. Uses Rust cache for faster builds.
K8s operator (volley-operator/): Automates StatefulSet, headless Service, metrics Service, ServiceMonitor, PodDisruptionBudget, and KEDA ScaledObject creation from a single VolleyApplication CRD. Server-side apply with idempotent reconciliation.
Checkpoint cleanup: RecoveryCoordinator handles automated old checkpoint cleanup.
What Exists
- Dockerfile: A multi-stage Dockerfile is provided at
deploy/docker/Dockerfilewith a Rust builder stage and minimal distroless runtime stage. Pipeline authors can extend or customize it for their specific binaries. - Helm chart: A Helm chart exists at
deploy/helm/volley/that templates theVolleyApplicationCRD withvalues.yamlcovering all CRD fields (resources, storage, observability, health, autoscaling, Kafka, isolation).
Remaining Gaps
- Dockerfile not referenced in documentation: The README and operator docs do not mention the Dockerfile or describe how pipeline authors should customize it for their applications.
- Helm chart has no production-hardening overlays: No example values files for common deployment topologies (small/medium/large, multi-AZ).
- No CD pipeline: CI exists but no automated release workflow (crate publishing, image building).
- No benchmark regression tracking: Benchmarks compile-checked but results not tracked over time.
Recommendations
(P1) Add a CD workflow (.github/workflows/release.yml) triggered on git tags that builds container images and publishes to a registry.
(P1) Add production-hardened example values files to the Helm chart (e.g., values-production.yaml with resource limits, anti-affinity, KEDA autoscaling).
(P2) Add benchmark regression detection in CI using criterion with stored baselines.
Pillar 4: Chaos Engineering — D (30%)
Current State
Checkpoint recovery is tested via unit tests — the HealthReporter transitions through Starting -> Running -> Degraded -> Failed states and these are verified. The operator handles reconciliation errors with retry backoff (15s requeue on failure).
Gaps
- No fault injection tests simulating Kafka broker disconnect, disk full, or OOM conditions.
- No integration tests exercising the Degraded -> Running recovery path under realistic failure.
- No Chaos Mesh or LitmusChaos experiment definitions for K8s-deployed pipelines.
Recommendations
(P1) Add integration tests for framework resilience:
- Checkpoint directory becomes read-only mid-pipeline
- Kafka broker disconnect during exactly-once transaction commit
- RocksDB write failure during state flush
(P1) Provide Chaos Mesh experiment templates for K8s-deployed Volley applications:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: volley-pod-kill
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app.kubernetes.io/managed-by: volley-operator
scheduler:
cron: "@every 30m"
(P2) Add toxiproxy integration to Kafka integration tests for network degradation simulation.
Pillar 5: Capacity Planning — C (45%)
Current State
Autoscaling: KEDA support fully scaffolded in operator (AutoscalingSpec in CRD) with Kafka lag, Prometheus metric, and CPU-based triggers. ScaledObject is created during reconciliation.
Benchmarks: benchmark_throughput and benchmark_sustained examples demonstrate 9.4M rec/s sustained on a single node.
Resource management: CRD requires explicit resources.requests and resources.limits for CPU/memory.
Gaps
- No sizing guide mapping resource allocation to expected throughput.
- No documentation on RocksDB storage growth patterns.
- No tested KEDA trigger threshold examples.
- No resource utilization dashboards.
Recommendations
(P1) Create a capacity planning guide covering:
- CPU/memory requirements per records/sec (based on benchmark data)
- RocksDB storage growth formula:
key_count * avg_value_size * checkpoint_retention_count - PVC sizing guidance based on state size and checkpoint interval
- Key PromQL queries for capacity monitoring:
- Throughput:
rate(volley_source_records_polled_total[5m]) - Backpressure:
volley_channel_utilization > 0.8 - Input backlog:
volley_kafka_consumer_lag
- Throughput:
(P2) Document tested KEDA trigger configurations with proven thresholds (e.g., scale at consumer lag > 10000).
Pillar 6: Progressive Rollouts — F (5%)
Current State
The operator creates StatefulSets with a configurable replicas count but uses the default RollingUpdate strategy. No canary, blue-green, or partition-based rollout mechanism exists. imagePullPolicy is configurable but no image tag pinning is enforced.
Gaps
- No
updateStrategyfield in the CRD spec. - No partition-based canary rollout support.
- No automated rollback on health check failure post-update.
- No Argo Rollouts or Flagger integration.
Recommendations
(P1) Add an updateStrategy field to VolleyApplicationSpec CRD with partition-based canary support:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize, JsonSchema)]
pub struct UpdateStrategySpec {
/// Partition value for canary rollouts. Pods with ordinal >= partition
/// get the new version. Set to replicas-1 to canary a single pod.
#[serde(default, rename = "partition")]
pub partition: Option<i32>,
}
}
(P2) Provide an Argo Rollouts AnalysisTemplate example that checks volley_errors_total rate and volley_stage_latency_ms p99 after each rollout step.
Pillar 7: Incident Response Primitives — C+ (55%)
Current State
Health states: PipelineHealth enum provides 5 states — Starting, Running, Degraded (with reason string), Failed (with reason string), ShuttingDown. These map to K8s liveness (healthy = Starting|Running) and readiness (ready = Running only).
Health gauge: volley.pipeline.health gauge is emitted on every state transition via HealthReporter, enabling Prometheus alerting on health state changes. Values: 0=Starting, 1=Running, 2=Degraded, 3=Failed, 4=ShuttingDown.
Operator status: CRD status tracks phases (Pending, Running, Degraded, Failed) with readyReplicas count and human-readable message.
Probes: Liveness at /health (port 8080), readiness at /ready, with configurable initialDelaySeconds and periodSeconds.
Alerting rules: deploy/prometheus/alerting-rules.yaml includes VolleyPipelineFailed (critical, health==3 for 1m) and VolleyPipelineDegraded (warning, health==2 for 5m).
Remaining Gaps
- No documentation mapping severity levels to PipelineHealth states and SLO impact.
- No runbook links in alert annotations (placeholder URLs only).
Recommendations
(P1) Document severity-to-health mapping for pipeline authors:
| Severity | PipelineHealth | SLO Impact | Response |
|---|---|---|---|
| SEV1 | Failed | SLO breached, pipeline stopped | Immediate: investigate root cause, restore from checkpoint |
| SEV2 | Degraded | Error budget burning above normal | Urgent: identify degradation source (e.g., checkpoint failures) |
| SEV3 | Running with elevated latency | Within budget but trending | Investigate: check channel.utilization, watermark.lag_ms |
(P1) Add runbook URLs to the existing alerting rules in deploy/prometheus/alerting-rules.yaml, linking to operational documentation for each failure mode.
Prioritized Implementation Roadmap
P0 — Completed
SLO recording rule templates—deploy/prometheus/recording-rules.yamlBurn rate alerting rule templates—deploy/prometheus/alerting-rules.yamlStructured JSON logging—ObservabilityConfig.log_formatwithLogFormat::JsonMulti-stage Dockerfile—deploy/docker/DockerfileHelm chart—deploy/helm/volley/Pipeline health gauge—volley.pipeline.healthemitted byHealthReporterHealth state alerting rules—VolleyPipelineFailed,VolleyPipelineDegradedin alerting rules
P0 — Remaining
- Golden signals dashboard template (Grafana JSON)
- Severity-to-PipelineHealth mapping documentation
P1 — Framework maturity
- Runtime error budget gauge metric in
volley-observability SLO definition template format for per-pipeline targets—deploy/prometheus/slo-template.yamlTrace-log correlation (trace_id in structured log output)—volley-core/src/observability/trace_context.rswithTraceContextFormatFault injection integration tests (checkpoint, RocksDB failures)—volley-core/tests/fault_injection_test.rs(9 tests)- Chaos Mesh experiment templates
Capacity planning guide (sizing, storage growth, key PromQL)—docs/src/operations/capacity-planning.mdupdateStrategyCRD field for partition-based canary rollouts- Runbook URLs in alerting rule annotations
- CD workflow for framework releases
- Production-hardened Helm values examples
P2 — Excellence
Per-connector error counters (— constants involley.source.errors,volley.sink.errors)volley-observability/src/metrics.rs, wired up involley-core/src/builder/runtime.rswithconnector_id()on Source/Sink traits- Benchmark regression detection in CI
- Toxiproxy integration for Kafka fault injection
- Argo Rollouts AnalysisTemplate example
- Tested KEDA trigger threshold documentation
- Error budget policy template documentation
Appendix A: Full Metric Inventory
All metrics defined in volley-observability/src/metrics.rs:
| Constant | Metric Name | Type | Labels | SRE Use |
|---|---|---|---|---|
SOURCE_RECORDS_POLLED | volley.source.records_polled | Counter | node_id, job_name, source_id | Traffic golden signal |
RECORDS_PROCESSED | volley.records.processed | Counter | node_id, job_name, operator_id | Throughput, SLI denominator |
STAGE_LATENCY_MS | volley.stage.latency_ms | Histogram | node_id, job_name, stage | Latency golden signal, SLI |
CHECKPOINT_DURATION_MS | volley.checkpoint.duration_ms | Histogram | node_id, job_name | Checkpoint health |
CHECKPOINT_EPOCH | volley.checkpoint.epoch | Gauge | node_id, job_name | Progress tracking |
CHECKPOINT_FAILURES | volley.checkpoint.failures | Counter | node_id, job_name | Reliability, alerting |
CHANNEL_UTILIZATION | volley.channel.utilization | Gauge | node_id, job_name, stage | Saturation golden signal |
WATERMARK_LAG_MS | volley.watermark.lag_ms | Gauge | node_id, job_name | Event-time lag |
SINK_RECORDS_WRITTEN | volley.sink.records_written | Counter | node_id, job_name, sink_id | Output throughput |
SINK_FLUSH_DURATION_MS | volley.sink.flush.duration_ms | Histogram | node_id, job_name, sink_id | Sink latency |
PIPELINE_UPTIME_SECONDS | volley.pipeline.uptime_seconds | Gauge | node_id, job_name | Availability |
PIPELINE_HEALTH | volley.pipeline.health | Gauge | node_id, job_name | Health state (0-4), alerting on Failed/Degraded |
ERRORS_TOTAL | volley.errors | Counter | node_id, job_name, error_type | Error golden signal, SLI numerator |
KAFKA_CONSUMER_LAG | volley.kafka.consumer_lag | Gauge | node_id, job_name | Backlog, autoscaling trigger |
KAFKA_COMMIT_DURATION_MS | volley.kafka.commit.duration_ms | Histogram | node_id, job_name | Kafka commit latency |
KAFKA_COMMIT_FAILURES | volley.kafka.commit.failures | Counter | node_id, job_name | Exactly-once reliability |
BLOB_OBJECTS_READ | volley.blob.objects_read | Counter | node_id, job_name | Blob source throughput |
BLOB_BYTES_READ | volley.blob.bytes_read | Counter | node_id, job_name | Blob I/O volume |
Label constants: LABEL_NODE_ID, LABEL_JOB_NAME, LABEL_STAGE, LABEL_OPERATOR_ID, LABEL_SOURCE_ID, LABEL_SINK_ID, LABEL_ERROR_TYPE
Appendix B: Reference Prometheus Rules
See Pillar 1 for SLO recording rules and burn rate alerts, and Pillar 7 for health state alerting rules.
Appendix C: Golden Signals PromQL Reference
# Latency (p50, p99)
histogram_quantile(0.5, sum(rate(volley_stage_latency_ms_bucket[5m])) by (le, job_name))
histogram_quantile(0.99, sum(rate(volley_stage_latency_ms_bucket[5m])) by (le, job_name))
# Traffic (records/sec by pipeline)
sum(rate(volley_source_records_polled_total[5m])) by (job_name)
# Errors (error rate by type)
sum(rate(volley_errors_total[5m])) by (job_name, error_type)
# Error ratio
sum(rate(volley_errors_total[5m])) by (job_name)
/
sum(rate(volley_records_processed_total[5m])) by (job_name)
# Saturation (channel utilization, higher = more backpressure)
avg(volley_channel_utilization) by (job_name, stage)
# Saturation (Kafka consumer lag)
sum(volley_kafka_consumer_lag) by (job_name)