ML Inference

Enrich every record with a model prediction — sentiment scores on chat messages, fraud scores on transactions, embeddings on product descriptions — without leaving your pipeline. The volley-ml crate gives you stream operators that call ONNX or Candle models in-process, or forward batches to an external model server (vLLM, Triton, LiteLLM). The ergonomics match the rest of Volley: a config builder, a method on your stream, and the model output becomes extra Arrow columns downstream.

Backend Configuration

Configure the ML backend once and share it across operators:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use volley_ml::prelude::*;

// CPU-only ONNX (simplest)
let ml = Arc::new(MlBackendConfig::onnx_cpu());

// ONNX on CUDA GPU with limits
let ml = Arc::new(MlBackendConfig::onnx_cuda(0)
    .with_num_threads(4)
    .with_memory_limit(4 * 1024 * 1024 * 1024));

// Candle on CPU (pure Rust, no C++ toolchain)
let ml = Arc::new(MlBackendConfig::candle_cpu());

// Candle on Metal (macOS GPU)
let ml = Arc::new(MlBackendConfig::candle_metal());

// ONNX on macOS Metal / Neural Engine via CoreML
let ml = Arc::new(MlBackendConfig::onnx_coreml());

// CoreML with specific compute units and model format
let ml = Arc::new(MlBackendConfig::onnx_coreml()
    .with_coreml_compute_units(CoreMLComputeUnits::CpuAndGpu)
    .with_coreml_model_format(CoreMLModelFormat::MLProgram));

// Candle with explicit architecture and HF revision
let ml = Arc::new(MlBackendConfig::candle_cpu()
    .with_architecture(CandleArchitecture::ModernBert)
    .with_hf_revision("v1.0"));
}

Loading ONNX Models from HuggingFace Hub

Pass a HuggingFace repo ID as the model path — the backend downloads and caches the ONNX model and tokenizer automatically:

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cpu());

stream
    .embed(EmbeddingConfig::new(
        "sentence-transformers/all-MiniLM-L6-v2",  // HF repo ID
        "text",
        ml.clone(),
    ))
    .await?
}

Models are cached in ~/.cache/huggingface/hub/ and reused across runs.

Revision Pinning

Pin a specific revision with @:

#![allow(unused)]
fn main() {
EmbeddingConfig::new("user/model@v1.0", "text", ml.clone())
}

Or set it on the config:

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cpu()
    .with_hf_revision("abc123"));
}

ONNX Model Variants

Many repos include optimized variants. Select one with with_onnx_model_file():

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cpu()
    .with_onnx_model_file("onnx/model_O2.onnx"));  // O2 optimized
}

Common variants:

onnx/model.onnx — standard (default)
onnx/model_O2.onnx — O2 optimized
onnx/model_qint8_arm64.onnx — quantized for ARM64

Operators

Embedding (`.embed()`)

Generate vector embeddings from a text column:

#![allow(unused)]
fn main() {
stream
    .embed(EmbeddingConfig::new("models/e5-small.onnx", "text", ml.clone())
        .with_output_column("embedding")
        .with_embedding_dim(384))
    .await?
}

Appends a FixedSizeList<Float32> column to the output.

Classification (`.classify()`)

Classify records and get label + confidence score:

#![allow(unused)]
fn main() {
stream
    .classify(ClassifyConfig::new("models/fraud.onnx", "features", ml.clone())
        .with_label_column("is_fraud")
        .with_score_column("fraud_score")
        .with_labels(vec!["legitimate", "fraud"]))
    .await?
}

Appends Utf8 label and Float32 score columns.

Generic Inference (`.infer()`)

Run any ONNX model on selected columns:

#![allow(unused)]
fn main() {
stream
    .infer(InferenceConfig::new("models/custom.onnx", ml.clone())
        .with_input_columns(vec!["feature_1", "feature_2"])
        .with_output_columns(vec![
            ("prediction", DataType::Float32),
        ]))
    .await?
}

External Model Server (`.infer_remote()`)

Call OpenAI-compatible APIs (vLLM, TGI, LiteLLM):

#![allow(unused)]
fn main() {
stream
    .infer_remote(RemoteInferenceConfig::new(
        "http://localhost:8000/v1/embeddings",
        ApiFormat::OpenAI,
    )
        .with_input_columns(vec!["text"])
        .with_output_columns(vec![("embedding", DataType::Float32)])
        .with_max_concurrent(32)
        .with_timeout(Duration::from_secs(10)))
    .await?
}

Candle Backend

The Candle backend provides pure-Rust inference using HuggingFace’s Candle library. No C++ toolchain required.

Loading from HuggingFace Hub

Pass a HuggingFace repo ID as the model path — the backend downloads and caches model weights, config, and tokenizer automatically:

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cpu());

stream
    .embed(EmbeddingConfig::new(
        "sentence-transformers/all-MiniLM-L6-v2",  // HF repo ID
        "text",
        ml.clone(),
    ))
    .await?
}

Pin a specific revision with @:

#![allow(unused)]
fn main() {
EmbeddingConfig::new("user/model@v1.0", "text", ml.clone())
}

Or set it on the config:

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cpu()
    .with_hf_revision("abc123"));
}

Loading from Local Files

Point to a local directory containing model.safetensors, config.json, and optionally tokenizer.json:

#![allow(unused)]
fn main() {
stream
    .embed(EmbeddingConfig::new("/models/my-bert", "text", ml.clone()))
    .await?
}

Supported Architectures

The backend auto-detects the architecture from the model’s config.json (model_type field):

Encoders

`model_type`	Architecture	Enum Variant	Use Cases
`bert`, `roberta`	BERT	`Bert`	Embeddings, classification
`distilbert`	DistilBERT	`DistilBert`	Faster embeddings (6-layer distilled)
`nomic_bert`	NomicBERT	`NomicBert`	Nomic Embed text embeddings
`jina_bert`	JinaBERT	`JinaBert`	Jina AI embeddings (ALiBi attention)
`modernbert`	ModernBERT	`ModernBert`	Improved BERT with rotary embeddings
`deberta-v2`	DeBERTa V2	`DebertaV2`	High-accuracy classification
`xlm-roberta`	XLM-RoBERTa	`XlmRoberta`	Multilingual embeddings

Decoders

`model_type`	Architecture	Enum Variant	Use Cases
`llama`	Llama	`Llama`	Classification, inference
`mistral`	Mistral	`Mistral`	Classification, inference
`gemma2`	Gemma 2	`Gemma2`	Classification, inference
`phi3`	Phi-3	`Phi3`	Small/efficient inference
`deepseek_v2`	DeepSeek V2	`DeepSeekV2`	MoE inference

Encoder models output [batch, hidden_size] (CLS token extraction). Decoder models output [batch, vocab_size] (last-token logits).

Override auto-detection with with_architecture():

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cpu()
    .with_architecture(CandleArchitecture::ModernBert));
}

Tokenizer Integration

When the tokenizers feature is enabled and the model has a tokenizer.json, text columns are automatically tokenized before the forward pass. For HuggingFace Hub models, Volley also checks onnx/tokenizer.json when the repo keeps shared tokenizer assets under the onnx/ subdirectory. No manual tokenization needed.

For .classify(), the backend must return per-class logits. Volley now validates that the model output width matches the configured label count and raises an operator error when a backend returns encoder embeddings instead of classifier logits.

volley-ml = { version = "1.0.0", features = ["candle", "tokenizers"] }

GPU Acceleration

GPU inference is 10–100× faster for transformer models. Volley supports CUDA (NVIDIA) and Metal (macOS) GPUs.

CUDA (NVIDIA GPUs)

ONNX Runtime: CUDA is handled at runtime via execution providers — no compile-time feature flag needed. Use MlBackendConfig::onnx_cuda(device_id):

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cuda(0));  // GPU 0
}

If CUDA is unavailable, the ONNX Runtime silently falls back to CPU.

Candle: Requires the cuda compile-time feature (pulls in cudarc CUDA bindings):

volley-ml = { version = "1.0.0", features = ["candle", "tokenizers", "cuda"] }

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cuda(0));
}

Prerequisites: NVIDIA driver and CUDA toolkit must be installed. Verify with nvidia-smi or volley doctor.

Metal (macOS GPUs)

Requires the metal compile-time feature:

volley-ml = { version = "1.0.0", features = ["candle", "tokenizers", "metal"] }

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_metal());
}

CoreML (ONNX on macOS)

The CoreML execution provider accelerates ONNX models on macOS using Metal GPU and/or the Apple Neural Engine. Requires the coreml compile-time feature:

volley-ml = { version = "1.0.0", features = ["onnx", "coreml"] }

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_coreml());
}

If CoreML is unavailable (e.g., on Linux), the ONNX Runtime silently falls back to CPU.

Partial operator coverage: CoreML does not support every ONNX operation. Unsupported ops automatically fall back to CPU within the same session. For example, DistilBERT has 387 graph nodes but only 17 are CoreML-eligible. This means most computation still runs on CPU, and the CoreML compilation overhead can make the “GPU” path slower than pure CPU for small models or small batch sizes. Use GPU=cpu to benchmark both paths. Models with convolution-heavy architectures (vision models) tend to have better CoreML coverage than attention-heavy transformers.

Compute units control which hardware CoreML dispatches to:

Value	Hardware
`All` (default)	CPU + GPU + Neural Engine
`CpuAndGpu`	CPU + Metal GPU only
`CpuAndNeuralEngine`	CPU + Neural Engine only
`CpuOnly`	CPU only (debugging)

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_coreml()
    .with_coreml_compute_units(CoreMLComputeUnits::CpuAndGpu));
}

Model format controls the internal CoreML representation:

Format	Requirement	Notes
`NeuralNetwork` (default)	macOS 10.15+	Broadest compatibility
`MLProgram`	macOS 12+	More operators, potentially faster

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_coreml()
    .with_coreml_model_format(CoreMLModelFormat::MLProgram));
}

GPU Detection

Check GPU availability at runtime without loading a model:

#![allow(unused)]
fn main() {
let info = volley_ml::gpu::detect_gpu();
println!("{}", info.summary);
// "CUDA: disabled (enable `cuda` feature); Metal: available; CoreML: available"

if info.cuda_available {
    println!("Using GPU with {} CUDA device(s)", info.cuda_device_count);
}
if info.coreml_available {
    println!("CoreML available for ONNX Metal acceleration");
}
}

The volley doctor CLI command also reports GPU status.

Feature Flags

Add volley-ml to your Cargo.toml with the features you need:

[dependencies]
# ONNX Runtime backend (CPU)
volley-ml = { version = "1.0.0", features = ["onnx"] }

# ONNX on CUDA GPU (runtime EP registration, no compile-time flag)
volley-ml = { version = "1.0.0", features = ["onnx"] }

# Candle backend with tokenizers (pure Rust, CPU)
volley-ml = { version = "1.0.0", features = ["candle", "tokenizers"] }

# Candle on CUDA GPU
volley-ml = { version = "1.0.0", features = ["candle", "tokenizers", "cuda"] }

# Candle on macOS Metal GPU
volley-ml = { version = "1.0.0", features = ["candle", "tokenizers", "metal"] }

# ONNX on macOS CoreML (Metal GPU + Neural Engine)
volley-ml = { version = "1.0.0", features = ["onnx", "coreml"] }

# External model servers
volley-ml = { version = "1.0.0", features = ["remote-http"] }

Backend Comparison

	ONNX Runtime (`onnx`)	Candle (`candle`)
Language	C++ via FFI	Pure Rust
Model format	ONNX (universal)	SafeTensors / HF Hub
Model loading	Local file or HuggingFace Hub	Local file or HuggingFace Hub
Tokenizer	External (user handles)	Built-in (`tokenizers` feature)
GPU	CUDA, TensorRT, CoreML (`coreml` feature)	CUDA, Metal
Best for	Any ONNX-exportable model	HF transformer models
Build	Needs C++ toolchain (auto-downloaded)	Pure `cargo build`

Performance Tuning

Session Pooling (ONNX)

By default, a single ONNX session is shared across all inference calls. For concurrent pipelines, pool multiple sessions to eliminate lock contention:

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cuda(0)
    .with_session_pool_size(4));  // 4 sessions, round-robin
}

Each session is an independent model copy. A pool of 2–4 is usually sufficient.

Embedding Cache

Enable an LRU cache to skip inference for repeated text:

#![allow(unused)]
fn main() {
stream
    .embed(EmbeddingConfig::new("model.onnx", "text", ml.clone())
        .with_cache(10_000))  // Cache up to 10,000 unique embeddings
    .await?
}

Identical input strings are served from cache. This can eliminate 50–90% of model calls in pipelines with repeated product names, log messages, etc.

Half-Precision Inference (FP16 / BF16)

Load Candle model weights in half precision for 2× memory savings and faster GPU inference on hardware with tensor cores:

#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cuda(0)
    .with_dtype(ModelDType::F16));
}

On CPU, F16/BF16 are automatically promoted to F32 since most CPUs lack native half-precision arithmetic.

Remote Batch Accumulation

When upstream batches are small, buffer them before sending to the model server:

#![allow(unused)]
fn main() {
RemoteInferenceConfig::new("http://model-server:8000/v1/embeddings", ApiFormat::OpenAI)
    .with_batch_accumulation(64, Duration::from_millis(50))
    .with_max_concurrent(16)
}

This sends a single HTTP request once 64 rows accumulate or 50ms elapse, reducing round trips.

Running the Example

The ML inference example works out of the box with zero configuration. It auto-downloads a sentiment classification model from HuggingFace Hub and auto-detects Metal (CoreML) on macOS:

# Just works — downloads model from Hub, auto-detects Metal on macOS
cargo run --example ml_inference_pipeline -p volley-examples --features ml

Override defaults with environment variables:

MODEL_PATH=./my-model.onnx — use a local ONNX model instead of downloading from Hub
GPU=cpu — force CPU backend (disables Metal/CoreML auto-detection)

Recent Bug Fixes

Hub tokenizer resolution — tokenizer download now checks sibling paths (e.g., onnx/tokenizer.json) when tokenizer.json is not at the repo root.
Softmax normalization — classification scores now use softmax normalization, producing proper [0, 1] probabilities instead of raw logits.

CLI Template

Scaffold an ML pipeline project with one command:

volley new --template ml-pipeline my-classifier
cd my-classifier

This generates a Kafka-to-Kafka pipeline with ONNX classification:

src/main.rs — Reads text from Kafka, classifies with .classify(), writes enriched records to output topic
Cargo.toml — Includes volley-ml with onnx and tokenizers features
.env.example — Kafka connection and MODEL_PATH configuration

Edit src/main.rs to customize:

Change the model: Update MODEL_PATH and the label list in .classify()
Swap operators: Replace .classify() with .embed() for embeddings or .infer() for generic inference
Enable GPU: Change MlBackendConfig::onnx_cpu() to MlBackendConfig::onnx_cuda(0)
Use Candle: Switch to MlBackendConfig::candle_cpu() for pure-Rust inference
Add remote inference: Chain .infer_remote(RemoteInferenceConfig::new(...))

Keyboard shortcuts

Volley Documentation