Back to blog

Rust in Python: Comment nous avons atteint 150× avec PyO3 & WebAssembly

Rust in Python: How We Achieved 150× with PyO3 & WebAssembly

1. Why Rust for Quantitative Finance

Python dominates quantitative finance thanks to its ecosystem (NumPy, pandas, scikit-learn) and rapid prototyping. But when the inner loop is a Monte Carlo simulation with \(10^6\) paths, or an order-book processor handling \(10^5\) events/second, CPython interpreter overhead becomes the bottleneck.

💡 The finding: Only 8% of Python execution time is useful computation. The remaining 92% is interpreter overhead (bytecode dispatch, object boxing, function calls).
Où va le temps CPU en Python? CPython bytecode dispatch 45% Object boxing (malloc) 28% np.random calls 18% Calcul réel: 8% GC, cache: 1% Rust élimine les 92% ⚡ Rust + PyO3 150× plus rapide 0.8s vs 120s
Figure 1: CPU time breakdown in a Python Monte Carlo simulation. Rust eliminates interpreter overhead.

Comparison of Acceleration Approaches

Approach Speedup Pros Cons
Vectorised NumPy 5–20× Easy, no build step Memory-hungry
Numba JIT 30–80× Simple decorator Fragile compilation, limited types
Cython 20–50× Mature Verbose syntax, manual memory management
C++ + pybind11 100–200× Maximum performance ⚠️ Segfaults, UB, complex builds
Rust + PyO3 100–200× ✅ Safe, parallel, clean Rust learning curve

2. Concrete Example: Monte Carlo Rough Heston

Here is the classic Python implementation of a Monte Carlo Rough Heston pricer. With 5,000 paths and 100 time steps, it takes ~120 seconds:

import numpy as np

def rh_mc_put_python(S, K, T, r, H, nu, rho, kappa, theta, v0,
                     n_paths=5000, n_steps=100):
    dt = T / n_steps
    sqrt_dt = np.sqrt(dt)
    
    payoffs = np.zeros(n_paths)
    for p in range(n_paths):
        S_t, V_t = S, v0
        for i in range(n_steps):
            Z1 = np.random.standard_normal()
            Z2 = rho * Z1 + np.sqrt(1 - rho**2) * np.random.standard_normal()
            V_t = max(V_t + kappa * (theta - V_t) * dt
                      + nu * np.sqrt(max(V_t, 0)) * sqrt_dt * Z2, 1e-8)
            S_t *= np.exp((r - 0.5 * V_t) * dt
                          + np.sqrt(max(V_t, 0)) * sqrt_dt * Z1)
        payoffs[p] = max(K - S_t, 0)
    
    return np.exp(-r * T) * np.mean(payoffs)

And here is the same logic in Rust with PyO3 and Rayon for parallelisation:

use pyo3::prelude::*;
use rayon::prelude::*;
use rand::prelude::*;
use rand_distr::StandardNormal;

#[pyfunction]
fn rh_mc_put(
    spot: f64, k: f64, t: f64, r: f64, h: f64, nu: f64, rho: f64,
    lambda_: f64, theta: f64, v0: f64, n_paths: usize, n_steps: usize,
) -> f64 {
    let payoffs: Vec<f64> = (0..n_paths)
        .into_par_iter()  // ← Rayon: parallélisation automatique
        .map(|_| {
            let (s_t, _) = simulate_path(spot, t, r, h, nu, rho,
                                          lambda_, theta, v0, n_steps);
            (k - s_t).max(0.0)
        })
        .collect();
    
    let mean = payoffs.iter().sum::<f64>() / n_paths as f64;
    (-r * t).exp() * mean
}
✅ Result: The Rust version executes in 0.8 seconds — that is 150× faster than Python, with the same numerical result.

3. Production Benchmarks

Here are the speedups measured on our production infrastructure (Apple M2 Pro, 12 cores):

Execution time per module (log scale)

Rough Heston MC
Python: 120s
Numba: 36s
Rust: 0.8s ⚡
Module Python Rust Speedup
Monte Carlo Rough Heston 120s 0.8s 150×
Order Book Aggregation 45ms 0.23ms 196×
Greeks (bump & reprice) 8.2s 52ms 158×
Regime Detection (HMM) 2.1s 18ms 117×
Portfolio Optimization 340ms 3.8ms 89×

4. Railway-Oriented Programming

Beyond raw performance, Rust brings a fundamental functional programming pattern for critical systems: Railway-Oriented Programming (ROP). This pattern, popularised by Scott Wlaschin, uses the Result<T, E> type for composable error handling.

Ok Err 1 validate() 2 transform() 3 compute() 4 persist() Chaque fonction retourne Result<T, E> — les erreurs "dérivent" vers la voie Err

Example of an ROP pipeline in Rust:

use anyhow::{Result, Context};

fn process_order(raw: &str) -> Result<ExecutedOrder> {
    let order = parse_order(raw)
        .context("Failed to parse order")?;           // ← Switch 1
    
    let validated = validate_order(order)
        .context("Order validation failed")?;          // ← Switch 2
    
    let priced = compute_price(&validated)
        .context("Pricing engine error")?;             // ← Switch 3
    
    let executed = execute_order(priced)
        .context("Execution failed")?;                 // ← Switch 4
    
    Ok(executed)
}

// Appel — pas de try/catch, l'erreur est dans le type
match process_order(raw_data) {
    Ok(order) => log::info!("Executed: {:?}", order),
    Err(e) => log::error!("Pipeline failed: {:?}", e),
}
🚃 Railway-Oriented Programming: Each step can fail and "derail" to the error track. Rust's ? operator automatically propagates errors without verbosity or unhandled exceptions.

Polarway: Our High-Performance Data Engine

Polarway is our data engine built on Polars — the 100% Rust DataFrame that is 31-72× faster than pandas. Polarway adds high-frequency finance specific features:

import polarway as pw

# Pipeline lazy avec optimisation automatique
pipeline = (
    pw.scan_parquet("trades/*.parquet")
    .filter(pw.col("volume") > 1000)
    .with_columns([
        pw.col("price").rolling_mean(window_size=60).alias("vwap_60s"),
        pw.col("price").pct_change().alias("returns"),
    ])
    .group_by_dynamic("timestamp", every="1m")
    .agg([
        pw.col("price").last().alias("close"),
        pw.col("volume").sum().alias("volume"),
        pw.col("returns").std().alias("realized_vol"),
    ])
)

# Exécution parallèle sur tous les cœurs
df = pipeline.collect()  # ~50× plus rapide que pandas équivalent

📚 Polarway Documentation

Explore the complete documentation with examples and API reference.

ReadTheDocs GitHub

5. Rust + WebAssembly: The Browser-Native Future

Beyond PyO3, Rust also compiles to WebAssembly (WASM) — enabling quantitative calculations to run directly in the browser with near-native performance:

Chrome V8 Engine Firefox SpiderMonkey Safari JavaScriptCore Edge Node.js 🔥 WebAssembly Runtime (Near-Native Speed) 🦀 hft_wasm_compute.wasm (4 KB) Fonctions exposées: • black_scholes() • implied_vol() • monte_carlo() • greeks()
Figure 2: WASM architecture — the same Rust code runs in the browser with native performance.

6. Polarway: Architecture and Performance

Polarway combines the strengths of Polars (Apache Arrow query engine) with HFT-specific extensions:

🐍 Python API (polarway, polars) PyO3 FFI Bridge 🦀 Polars Query Engine (Rust) Lazy Evaluation • Predicate Pushdown • Parallel Execution SIMD Vectorization • Memory-Mapped I/O Apache Arrow Columnar Format Parquet · Arrow IPC · Delta Lake · DuckDB vs pandas 31-72× faster
Figure 3: Polarway stack — from Python to optimised I/O via Rust and Apache Arrow.

7. Conclusion & Resources

By migrating the critical 20% of our codebase to Rust with PyO3, we achieved speedups of 50× to 200× while keeping a clean Python API. Adding WebAssembly allows running the same computations in the browser.

🎯 Takeaways:
  • Rust + PyO3 = C++ performance + memory safety
  • Rayon = trivial loop parallelisation
  • Railway-Oriented Programming = composable error handling
  • WASM = browser portability without recompilation
  • Polarway/Polars = DataFrames 31-72× faster than pandas

📚 Resources

🚀 Try HFThot Lab

Experiment with our Rust-accelerated labs: Monte Carlo, Greeks, Portfolio Optimization...

Launch Demo View Plans