ML Inference
The volley-ml crate adds machine learning inference operators to Volley pipelines. Run models embedded in-process or call external model servers — all as standard stream operators.
Backend Configuration
Configure the ML backend once and share it across operators:
#![allow(unused)]
fn main() {
use std::sync::Arc;
use volley_ml::prelude::*;
// CPU-only ONNX (simplest)
let ml = Arc::new(MlBackendConfig::onnx_cpu());
// ONNX on CUDA GPU with limits
let ml = Arc::new(MlBackendConfig::onnx_cuda(0)
.with_num_threads(4)
.with_memory_limit(4 * 1024 * 1024 * 1024));
// Candle on CPU (pure Rust, no C++ toolchain)
let ml = Arc::new(MlBackendConfig::candle_cpu());
// Candle on Metal (macOS GPU)
let ml = Arc::new(MlBackendConfig::candle_metal());
// ONNX on macOS Metal / Neural Engine via CoreML
let ml = Arc::new(MlBackendConfig::onnx_coreml());
// CoreML with specific compute units and model format
let ml = Arc::new(MlBackendConfig::onnx_coreml()
.with_coreml_compute_units(CoreMLComputeUnits::CpuAndGpu)
.with_coreml_model_format(CoreMLModelFormat::MLProgram));
// Candle with explicit architecture and HF revision
let ml = Arc::new(MlBackendConfig::candle_cpu()
.with_architecture(CandleArchitecture::Bert)
.with_hf_revision("v1.0"));
}
Loading ONNX Models from HuggingFace Hub
Pass a HuggingFace repo ID as the model path — the backend downloads and caches the ONNX model and tokenizer automatically:
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cpu());
stream
.embed(EmbeddingConfig::new(
"sentence-transformers/all-MiniLM-L6-v2", // HF repo ID
"text",
ml.clone(),
))
.await?
}
Models are cached in ~/.cache/huggingface/hub/ and reused across runs.
Revision Pinning
Pin a specific revision with @:
#![allow(unused)]
fn main() {
EmbeddingConfig::new("user/model@v1.0", "text", ml.clone())
}
Or set it on the config:
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cpu()
.with_hf_revision("abc123"));
}
ONNX Model Variants
Many repos include optimized variants. Select one with with_onnx_model_file():
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cpu()
.with_onnx_model_file("onnx/model_O2.onnx")); // O2 optimized
}
Common variants:
onnx/model.onnx— standard (default)onnx/model_O2.onnx— O2 optimizedonnx/model_qint8_arm64.onnx— quantized for ARM64
Operators
Embedding (.embed())
Generate vector embeddings from a text column:
#![allow(unused)]
fn main() {
stream
.embed(EmbeddingConfig::new("models/e5-small.onnx", "text", ml.clone())
.with_output_column("embedding")
.with_embedding_dim(384))
.await?
}
Appends a FixedSizeList<Float32> column to the output.
Classification (.classify())
Classify records and get label + confidence score:
#![allow(unused)]
fn main() {
stream
.classify(ClassifyConfig::new("models/fraud.onnx", "features", ml.clone())
.with_label_column("is_fraud")
.with_score_column("fraud_score")
.with_labels(vec!["legitimate", "fraud"]))
.await?
}
Appends Utf8 label and Float32 score columns.
Generic Inference (.infer())
Run any ONNX model on selected columns:
#![allow(unused)]
fn main() {
stream
.infer(InferenceConfig::new("models/custom.onnx", ml.clone())
.with_input_columns(vec!["feature_1", "feature_2"])
.with_output_columns(vec![
("prediction", DataType::Float32),
]))
.await?
}
External Model Server (.infer_remote())
Call OpenAI-compatible APIs (vLLM, TGI, LiteLLM):
#![allow(unused)]
fn main() {
stream
.infer_remote(RemoteInferenceConfig::new(
"http://localhost:8000/v1/embeddings",
ApiFormat::OpenAI,
)
.with_input_columns(vec!["text"])
.with_output_columns(vec![("embedding", DataType::Float32)])
.with_max_concurrent(32)
.with_timeout(Duration::from_secs(10)))
.await?
}
Candle Backend
The Candle backend provides pure-Rust inference using HuggingFace’s Candle library. No C++ toolchain required.
Loading from HuggingFace Hub
Pass a HuggingFace repo ID as the model path — the backend downloads and caches model weights, config, and tokenizer automatically:
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cpu());
stream
.embed(EmbeddingConfig::new(
"sentence-transformers/all-MiniLM-L6-v2", // HF repo ID
"text",
ml.clone(),
))
.await?
}
Pin a specific revision with @:
#![allow(unused)]
fn main() {
EmbeddingConfig::new("user/model@v1.0", "text", ml.clone())
}
Or set it on the config:
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cpu()
.with_hf_revision("abc123"));
}
Loading from Local Files
Point to a local directory containing model.safetensors, config.json, and optionally tokenizer.json:
#![allow(unused)]
fn main() {
stream
.embed(EmbeddingConfig::new("/models/my-bert", "text", ml.clone()))
.await?
}
Supported Architectures
The backend auto-detects the architecture from the model’s config.json (model_type field):
model_type | Architecture | Use Cases |
|---|---|---|
bert, distilbert, roberta | BERT | Embeddings, classification |
Override auto-detection with with_architecture():
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cpu()
.with_architecture(CandleArchitecture::Bert));
}
Tokenizer Integration
When the tokenizers feature is enabled and the model has a tokenizer.json, text columns are automatically tokenized before the forward pass. No manual tokenization needed.
volley-ml = { version = "0.8.0", features = ["candle", "tokenizers"] }
GPU Acceleration
GPU inference is 10–100× faster for transformer models. Volley supports CUDA (NVIDIA) and Metal (macOS) GPUs.
CUDA (NVIDIA GPUs)
ONNX Runtime: CUDA is handled at runtime via execution providers — no
compile-time feature flag needed. Use MlBackendConfig::onnx_cuda(device_id):
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cuda(0)); // GPU 0
}
If CUDA is unavailable, the ONNX Runtime silently falls back to CPU.
Candle: Requires the cuda compile-time feature (pulls in cudarc CUDA
bindings):
volley-ml = { version = "0.8.2", features = ["candle", "tokenizers", "cuda"] }
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cuda(0));
}
Prerequisites: NVIDIA driver and CUDA toolkit must be installed. Verify
with nvidia-smi or volley doctor.
Metal (macOS GPUs)
Requires the metal compile-time feature:
volley-ml = { version = "0.8.2", features = ["candle", "tokenizers", "metal"] }
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_metal());
}
CoreML (ONNX on macOS)
The CoreML execution provider accelerates ONNX models on macOS using Metal GPU
and/or the Apple Neural Engine. Requires the coreml compile-time feature:
volley-ml = { version = "0.8.3", features = ["onnx", "coreml"] }
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_coreml());
}
If CoreML is unavailable (e.g., on Linux), the ONNX Runtime silently falls back to CPU.
Partial operator coverage: CoreML does not support every ONNX operation. Unsupported ops automatically fall back to CPU within the same session. For example, DistilBERT has 387 graph nodes but only 17 are CoreML-eligible. This means most computation still runs on CPU, and the CoreML compilation overhead can make the “GPU” path slower than pure CPU for small models or small batch sizes. Use
GPU=cputo benchmark both paths. Models with convolution-heavy architectures (vision models) tend to have better CoreML coverage than attention-heavy transformers.
Compute units control which hardware CoreML dispatches to:
| Value | Hardware |
|---|---|
All (default) | CPU + GPU + Neural Engine |
CpuAndGpu | CPU + Metal GPU only |
CpuAndNeuralEngine | CPU + Neural Engine only |
CpuOnly | CPU only (debugging) |
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_coreml()
.with_coreml_compute_units(CoreMLComputeUnits::CpuAndGpu));
}
Model format controls the internal CoreML representation:
| Format | Requirement | Notes |
|---|---|---|
NeuralNetwork (default) | macOS 10.15+ | Broadest compatibility |
MLProgram | macOS 12+ | More operators, potentially faster |
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_coreml()
.with_coreml_model_format(CoreMLModelFormat::MLProgram));
}
GPU Detection
Check GPU availability at runtime without loading a model:
#![allow(unused)]
fn main() {
let info = volley_ml::gpu::detect_gpu();
println!("{}", info.summary);
// "CUDA: disabled (enable `cuda` feature); Metal: available; CoreML: available"
if info.cuda_available {
println!("Using GPU with {} CUDA device(s)", info.cuda_device_count);
}
if info.coreml_available {
println!("CoreML available for ONNX Metal acceleration");
}
}
The volley doctor CLI command also reports GPU status.
Feature Flags
Add volley-ml to your Cargo.toml with the features you need:
[dependencies]
# ONNX Runtime backend (CPU)
volley-ml = { version = "0.8.2", features = ["onnx"] }
# ONNX on CUDA GPU (runtime EP registration, no compile-time flag)
volley-ml = { version = "0.8.2", features = ["onnx"] }
# Candle backend with tokenizers (pure Rust, CPU)
volley-ml = { version = "0.8.2", features = ["candle", "tokenizers"] }
# Candle on CUDA GPU
volley-ml = { version = "0.8.2", features = ["candle", "tokenizers", "cuda"] }
# Candle on macOS Metal GPU
volley-ml = { version = "0.8.2", features = ["candle", "tokenizers", "metal"] }
# ONNX on macOS CoreML (Metal GPU + Neural Engine)
volley-ml = { version = "0.8.3", features = ["onnx", "coreml"] }
# External model servers
volley-ml = { version = "0.8.2", features = ["remote-http"] }
Backend Comparison
ONNX Runtime (onnx) | Candle (candle) | |
|---|---|---|
| Language | C++ via FFI | Pure Rust |
| Model format | ONNX (universal) | SafeTensors / HF Hub |
| Model loading | Local file or HuggingFace Hub | Local file or HuggingFace Hub |
| Tokenizer | External (user handles) | Built-in (tokenizers feature) |
| GPU | CUDA, TensorRT, CoreML (coreml feature) | CUDA, Metal |
| Best for | Any ONNX-exportable model | HF transformer models |
| Build | Needs C++ toolchain (auto-downloaded) | Pure cargo build |
Performance Tuning
Session Pooling (ONNX)
By default, a single ONNX session is shared across all inference calls. For concurrent pipelines, pool multiple sessions to eliminate lock contention:
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::onnx_cuda(0)
.with_session_pool_size(4)); // 4 sessions, round-robin
}
Each session is an independent model copy. A pool of 2–4 is usually sufficient.
Embedding Cache
Enable an LRU cache to skip inference for repeated text:
#![allow(unused)]
fn main() {
stream
.embed(EmbeddingConfig::new("model.onnx", "text", ml.clone())
.with_cache(10_000)) // Cache up to 10,000 unique embeddings
.await?
}
Identical input strings are served from cache. This can eliminate 50–90% of model calls in pipelines with repeated product names, log messages, etc.
Half-Precision Inference (FP16 / BF16)
Load Candle model weights in half precision for 2× memory savings and faster GPU inference on hardware with tensor cores:
#![allow(unused)]
fn main() {
let ml = Arc::new(MlBackendConfig::candle_cuda(0)
.with_dtype(ModelDType::F16));
}
On CPU, F16/BF16 are automatically promoted to F32 since most CPUs lack native half-precision arithmetic.
Remote Batch Accumulation
When upstream batches are small, buffer them before sending to the model server:
#![allow(unused)]
fn main() {
RemoteInferenceConfig::new("http://model-server:8000/v1/embeddings", ApiFormat::OpenAI)
.with_batch_accumulation(64, Duration::from_millis(50))
.with_max_concurrent(16)
}
This sends a single HTTP request once 64 rows accumulate or 50ms elapse, reducing round trips.
Running the Example
The ML inference example works out of the box with zero configuration. It auto-downloads a sentiment classification model from HuggingFace Hub and auto-detects Metal (CoreML) on macOS:
# Just works — downloads model from Hub, auto-detects Metal on macOS
cargo run --example ml_inference_pipeline -p volley-examples --features ml
Override defaults with environment variables:
MODEL_PATH=./my-model.onnx— use a local ONNX model instead of downloading from HubGPU=cpu— force CPU backend (disables Metal/CoreML auto-detection)
Recent Bug Fixes
- Hub tokenizer resolution — tokenizer download now checks sibling paths
(e.g.,
onnx/tokenizer.json) whentokenizer.jsonis not at the repo root. - Softmax normalization — classification scores now use softmax normalization, producing proper [0, 1] probabilities instead of raw logits.
CLI Template
Scaffold an ML pipeline project with one command:
volley new --template ml-pipeline my-classifier
cd my-classifier
This generates a Kafka-to-Kafka pipeline with ONNX classification:
src/main.rs— Reads text from Kafka, classifies with.classify(), writes enriched records to output topicCargo.toml— Includesvolley-mlwithonnxandtokenizersfeatures.env.example— Kafka connection andMODEL_PATHconfiguration
Edit src/main.rs to customize:
- Change the model: Update
MODEL_PATHand the label list in.classify() - Swap operators: Replace
.classify()with.embed()for embeddings or.infer()for generic inference - Enable GPU: Change
MlBackendConfig::onnx_cpu()toMlBackendConfig::onnx_cuda(0) - Use Candle: Switch to
MlBackendConfig::candle_cpu()for pure-Rust inference - Add remote inference: Chain
.infer_remote(RemoteInferenceConfig::new(...))