Back to blog

Zero-Cost Async : De Python asyncio au runtime Tokio de Rust

Zero-Cost Async: From Python asyncio to Rust's Tokio Runtime

How Polarway replaces the GIL with a multi-threaded Tokio runtime — with Result/Option monads and zero exceptions in production.

🇫🇷 This article contains Rust and Python code. Examples are identical in both languages.

1. The problem: asyncio's hidden costs

Python asyncio is designed for concurrent I/O. It's an excellent tool for web applications, but it hides a truth every data engineer encounters sooner or later: the GIL (Global Interpreter Lock) prevents any real parallelism. And you don't discover this truth by reading documentation — you discover it at 2AM when PagerDuty alerts are firing and the Grafana dashboard shows a latency curve climbing like a staircase.

We learned this the hard way. Our first HFT ingestion pipeline was beautiful asyncio code — gather() everywhere, elegant coroutines, an architecture that looked like a textbook diagram. In staging, everything worked. In production, with 50 WebSockets open simultaneously and Polars DataFrames to deserialize between each tick, throughput plateaued at 70-80 queries per second. We added CPU cores, nothing changed. The profiler showed us the reality: 94% of execution time was sequential. The GIL turned our beautiful concurrency into a queue.

I remember the exact moment. We had just spent 3 weeks refactoring the pipeline to make it "properly asynchronous" — we followed every guide, split every operation into coroutines, added Python Semaphores for rate limiting. The code was almost beautiful. We deployed on a Friday evening (yes, I know), and Monday morning the trading team's Slack was on fire: "Why are ticks arriving with 200ms delay? We missed 3 arbitrage opportunities this morning." Three arbitrage opportunities. At $500-2000 each. In one morning. The cost of asyncio was no longer abstract — it had a price in euros.

An asyncio event loop runs one coroutine at a time on one thread. The illusion of concurrency comes from I/O releasing the GIL — but the moment your code touches a DataFrame, serializes JSON, or filters Parquet, you're back to sequential:

Python asyncio

Single-threaded event loop
GIL blocks CPU-bound work
Heap-allocated coroutines
No backpressure built-in
Exception-based errors
Saturates at ~70 QPS (GIL)

Rust + Tokio

Multi-threaded work-stealing
No GIL — true parallelism
Zero-cost state machines
Built-in backpressure (mpsc)
Monadic Result<T,E>
Linear scaling to 1200+ QPS

The difference isn't about optimization. You can't "optimize" asyncio to work around the GIL — it's an architectural limitation. Every workaround (ProcessPoolExecutor, multiprocessing, uvloop) adds complexity without solving the fundamental problem: Python executes one bytecode at a time. We tried each of these approaches. uvloop gives ~15% on scheduling, but the GIL remains. ProcessPoolExecutor requires serializing DataFrames between processes — which almost cancels out all the gains. The only real solution: escape the GIL entirely.

Let me be more specific about these workarounds, because each one cost us weeks of work before we understood we were wasting our time. uvloop was our first hope — a drop-in replacement that promises 2-4x on event loop scheduling. We installed it, benchmarked it: 15% gain on dispatch, zero gain on DataFrame serialization. The GIL doesn't care — it doesn't distinguish "fast" code from "slow" code, it locks everything the same. Next, ProcessPoolExecutor: we launch Parquet reads in separate processes. Sounds great, except each 50MB DataFrame needs to be pickle-serialized to cross the inter-process boundary. At 50 files, serialization time exceeds read time. We literally tried to parallelize a problem and created a worse one. Finally, multiprocessing.shared_memory for DataFrames: it works, but the code looks like 90s C++ — manual offsets, byte sizes, silent segfaults when you're off by one byte. That's not Python anymore, that's suffering with f-strings.

The fundamental difference is mathematical. Python's throughput complexity is bounded:

$$ T_{\text{asyncio}}(n) = O(n) \quad \text{(sequential, GIL-bound)} $$ $$ T_{\text{tokio}}(n, c) = O\!\left(\frac{n}{c}\right) \quad \text{where } c = \text{CPU cores} $$

Tokio divides work by the number of cores — asyncio cannot.

This asymmetry is what drove us to build Polarway. Not out of love for Rust (although, let's be honest, pattern matching on Result values is addictive), but because no amount of Python cleverness can transform $O(n)$ into $O(n/c)$. It's a language invariant, not a bug to fix. And once you accept that — truly, viscerally, not just intellectually — everything else follows naturally. You stop looking for the magic workaround. You stop reading Reddit threads promising that "Python 3.13 will change everything with sub-interpreter GIL removal". You make the hard decision: rewrite the critical layer in a language capable of real parallelism, and keep Python where it excels — user interfaces, interactive data science, prototyping. Each to its own.

2. The 4 pillars of Polarway

Polarway was designed around four non-negotiable performance invariants. I use the word "invariant" deliberately — not "goal", not "aspiration". An invariant, in mathematics, is a property that remains true regardless of the transformation applied. The four numbers below must hold true regardless of the dataset, regardless of the query pattern, regardless of the time of day. If any single one breaks, the system has regressed — and we treat it as a bug, not as acceptable degradation.

Why these four specifically? Because they correspond to the four failure modes we observed in production. Latency kills high-frequency trading — a tick arriving 2ms late is a useless tick, plain and simple. In a market that moves 0.5% in 100ms, 2ms is an eternity. Throughput limits how many feeds a single node can absorb: when you go from 10 crypto pairs to 40, you need 4x the throughput, not an excuse about why "it's enough for the demo". Memory determines whether your pipeline survives a volatile trading day without an OOM-kill at 3AM — and nothing is more depressing than getting a PagerDuty call because your server ate 64GB of RAM during a flash crash. And CPU scalability dictates your infrastructure cost — if every doubling of load requires a doubling of machines, your cloud bill diverges like a harmonic series.

⏱️
< 1ms
End-to-end latency
p50 = 0.8ms, p99 = 1.2ms
🚀
100k+/core
Throughput (ticks/second)
120k measured on c6i.4xlarge
💾
O(batch)
Constant memory
500MB @ 50GB dataset
⚙️
Linear
CPU scalability
17.1x @ 500 concurrent queries

Key invariant: memory stays at $O(\text{batch\_size})$ regardless of dataset size. Polarway processes 50 GB with 500 MB of RAM — where Polars fails with OOM.

Each pillar is measurable and verifiable. These are not aspirational goals — they're contracts. Our CI runs benchmarks on every PR and fails if any of the four invariants regresses. A commit that improves throughput but degrades p99 latency beyond 1.2ms is automatically rejected. It's rigid, but it's what allows us to promise these numbers to users rather than hope for them.

I'll be honest: this rigidity has a cost. Three times we've blocked a PR for over a week because a memory optimization degraded p99 by 0.1ms. Three times the developer wanted to argue that "0.1ms is within noise". Three times we refused. Because today's noise is tomorrow's regression. If you accept 0.1ms this week, you'll accept 0.2ms next week, and in six months you're at 3ms wondering "how did we get here?". CI as a gatekeeper isn't popular. But stable pipelines are very popular with teams that sleep at night.

3. Tokio: what Rust brings to the table

Tokio's runtime transforms Rust async functions into compiled state machines — no heap allocation, no interpreter scheduler overhead. That's what "zero-cost" means. And it's not a marketing slogan — it's a verifiable property: disassemble the binary, look for allocations, count them. Zero.

When we chose Tokio, we also evaluated Go (goroutines + channels), Zig (async stack-allocated frames), and even Glommio's io_uring engine for Tokio-free Rust. Each option had its merits. Go is easier to learn — no lifetimes, no borrow checker, a garbage collector that "just works". But goroutines are heap-allocated (8KB each minimum), the GC introduces unpredictable pauses, and Go channels don't have typing as strict as Tokio channels. Zig was tempting for total control — but the ecosystem in 2025 was too immature for production, and the lack of a standard package manager made the supply chain risky. Tokio won by default: it's the most production-tested async runtime in the industry (Cloudflare, Discord, Fly.io all use it), with a community that ships security fixes in hours, not weeks.

To understand why this matters, you need to look at what *actually* happens when you write await in Python vs Rust. In Python, each coroutine is a heap-allocated object with its own frame and stack. The asyncio scheduler — written in pure Python (yes, even with uvloop the dispatch is interpreted) — must choose which coroutine to wake on each loop iteration. In Rust, the compiler generates an enum where each variant represents a state of the future. No allocation, no dynamic dispatch, no indirection. It's strictly equivalent to a hand-written state machine — the compiler does the work for you.

This is a point people underestimate. When I say "the compiler generates an enum", it means your 50-line async function with 3 await points is transformed into a 4-variant enum (initial state + 3 suspension points) that fits in 128 bytes on the stack. In Python, the same function allocates a coroutine object (~200 bytes), a frame object (~400 bytes), and potentially a task wrapper (~100 bytes). Multiply by 10,000 concurrent coroutines and the difference is: Rust uses 1.2MB on the stack, Python uses 7MB on the heap — with allocations, fragmentation, and the garbage collector having to clean all of it up.

Python asyncio (Heap) Coroutine Object ~200 bytes (heap) Frame Object ~400 bytes (heap) Task Wrapper ~100 bytes (heap) GC Pressure fragmentation ×10,000 = 7 MB heap ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ allocated + fragmented Rust Tokio (Stack) enum FutureState { ... } 4 variants — compiled at build time State 0 State 1 State 2 Done 128 bytes on stack — zero alloc ×10,000 = 1.2 MB stack ▓▓▓▓▓▓▓ compact 5.8× less memory — zero GC

Memory representation: Python coroutine (heap, fragmented) vs Rust future (stack, compact).

Work-stealing: concurrent file reads

Let's look at the actual code. Tokio's JoinSet API is the centerpiece of our concurrent read strategy. Instead of reading Parquet files sequentially, we spawn each read in a separate Tokio task. The scheduler handles distributing these tasks across all available cores — and if one core finishes early, it steals work from its neighbors.

use tokio::task::JoinSet;

async fn concurrent_parquet_reads(paths: Vec<String>) -> Result<Vec<DataFrame>, Error> {
    let mut set = JoinSet::new();

    for path in paths {
        set.spawn(async move {
            // Each file read runs on a separate Tokio task
            // Work-stealing ensures optimal CPU utilization
            read_parquet(&path).await
        });
    }

    let mut results = Vec::new();
    while let Some(res) = set.join_next().await {
        results.push(res??);
    }

    Ok(results)
}

Notice the double ? in res??. The first ? unwraps the JoinError (did the task panic?), the second unwraps our business Error. This is quintessentially Rust: errors are explicit at every layer, and the compiler forces you to handle them. In Python with asyncio, a gather() with return_exceptions=True gives you a list of Union[result, Exception] — and it's up to you to sort good from bad with isinstance() checks. Here, the type system does the work.

Tokio's JoinSet is also smarter than asyncio.gather() on a subtle point: work-stealing. When you spawn 100 tasks, Tokio doesn't put them all in the same queue. Each worker thread has its own queue, and when a worker finishes early, it *steals* a task from another worker's queue. Result: CPU utilization is near-optimal without any intervention on your part.

Work-stealing is an elegant algorithm formalized by Blumofe and Leiserson at MIT in the 90s, and Tokio implements it with lock-free queues. The intuition is simple: imagine 4 cooks in a kitchen. Without work-stealing, each cook has their stack of orders and processes them sequentially — if cook 1 has 20 orders and cook 4 has only 2, cook 4 stands idle while cook 1 is overwhelmed. With work-stealing, cook 4 takes orders from cook 1's stack. Everyone works all the time. The beauty is that the stealing is nearly free: queues are implemented as lock-free deques, stealing happens from the opposite end of insertion (LIFO for the local worker, FIFO for the thief), and contention is virtually nonexistent. It's elegant, and it shows in our benchmarks: CPU utilization is consistently above 92% even with variable-duration tasks.

Tokio Work-Stealing Scheduler Worker Thread 1 Task A (running) Task B (queued) Task C (queued) Task D (queued) Task E (queued) Worker Thread 2 Task F (running) Task G (queued) Worker Thread 3 idle — stealing... Worker Thread 4 Task H (running) steal Task E

Work-stealing: thread 3 (idle) steals task E from thread 1's queue (overloaded).

On the Python side, the client code is nearly identical to a classic asyncio pipeline — but under the hood, Tokio distributes work across all cores. This is Polarway's central promise: the data scientist doesn't need to know Rust. They write standard Python, and the gRPC server does the rest. Notice the API returns Result values — each read can fail independently, and errors are filtered explicitly.

from polarway.async_client import AsyncPolarwayClient

async with AsyncPolarwayClient("localhost:50051") as client:
    # Read 100 files concurrently — Tokio work-steals across all cores
    results = await client.batch_read([
        f"data/file_{i:03d}.parquet" for i in range(100)
    ])

    handles = [r.unwrap() for r in results if r.is_ok()]

    # Concurrent collect
    tables = await client.batch_collect(handles)

    print(f"Processed {len(tables)} files concurrently")

Key point: the Python code barely changes. You replace polars.read_parquet() with client.batch_read() and gain 5x. No threads, no multiprocessing, no concurrent.futures.

This was our number one design goal: the data scientist sees *nothing* change in their workflow. Their code remains normal Python async/await. They don't need to know Rust, Tokio, or mpsc channels. The complexity is absorbed by the gRPC layer between the Python client and the Rust server. For them, it's just "Polars, but faster". And that's exactly what we wanted. A tool people *use* rather than a tool they *admire*.

There's a philosophy I borrowed from Rich Hickey (Clojure's creator): "Simple is not Easy." Building a tool that's *simple* to use (the Python API is 12 methods, not 120) requires *hard* work under the hood. The gRPC server does protocol buffering, automatic batching, connection pooling, Arrow zero-copy serialization. The Python client sees none of it. And that's hundreds of hours of design work: every time we added a parameter to the API, we asked "does the data scientist *need* to know this?". If the answer was no, we made it automatic. The best code is the code the user never has to write.

Built-in backpressure with Semaphore

The previous pattern spawns as many tasks as files — which is perfect when you have 100 files and 16 cores. But what happens when you have 10,000? You don't want 10,000 file descriptors open simultaneously. That's where Tokio's Semaphore comes in: it caps the number of concurrent tasks to a configurable maximum, and excess tasks wait cleanly for their turn.

use tokio::sync::Semaphore;
use std::sync::Arc;

async fn batch_with_backpressure(paths: Vec<String>) -> Result<(), Error> {
    let semaphore = Arc::new(Semaphore::new(10));  // Max 10 concurrent

    let mut handles = vec![];

    for path in paths {
        let sem = semaphore.clone();

        handles.push(tokio::spawn(async move {
            let _permit = sem.acquire().await?;  // Block if at capacity
            read_parquet(&path).await
        }));
    }

    for handle in handles {
        handle.await??;
    }

    Ok(())
}

Backpressure is a topic most asyncio tutorials completely ignore. In asyncio, if your producer sends messages faster than the consumer processes them, the asyncio.Queue grows without bound until OOM. With Tokio, the Semaphore above caps at 10 concurrent tasks. If an 11th arrives, it waits — no panic, no exception, just a clean suspension. It's the same pattern as hydraulic circuits: a pressure relief valve rather than a pipe that bursts.

I insist on backpressure because it was the number one cause of production incidents *before* Polarway. The classic scenario: the market panics (an Elon Musk tweet, a regulatory announcement), WebSockets send 10x their normal throughput, the asyncio.Queue swells from 1,000 to 100,000 messages in seconds, RAM jumps from 2GB to 30GB, and the Linux kernel kills the process. Recovery time: 3-5 minutes counting the restart, reconnecting to exchanges, and re-synchronizing state. During those 3-5 minutes, you're not trading, and the market continues without you. With a bounded Semaphore, the worst that happens is messages accumulate *on the producer side* (the WebSocket), which has its own TCP buffer. The consumer never sees more messages than it can handle. The system *slows down* instead of *crashing*. That's the difference between emergency braking and a car crash.

4. Monadic error handling — zero exceptions

Polarway uses zero exceptions. Every operation returns a Result<T, E> or Option<T>, exactly like Rust. This isn't just a convention — it's a type system property that makes errors impossible to ignore. And before you raise an eyebrow — yes, we're aware this is an unpopular position in Python. We stand by it.

This is probably the most controversial design decision in Polarway. In Python, exceptions are idiomatic — it's the EAFP pattern ("Easier to Ask Forgiveness than Permission"). The entire Python community is built on it. So why abandon them? Because EAFP is optimized for prototyping, not production. In a Jupyter notebook, try/except is perfect: you try something, it crashes, you fix it, you re-run. The feedback loop is 2 seconds. In production, the feedback loop is 3 hours — that's how long it takes for the bug to reach the right traffic percentile, trigger the alert, wake the on-call engineer, who opens the laptop, reads the logs, and figures out what happened. By then, the original exception is buried under 47 lines of traceback and 3 layers of retry logic.

Because in production, exceptions *lie*. A try/except gives you the illusion of handling errors, but in reality it catches everything — including errors you didn't anticipate. How many times have you seen except Exception as e: logger.error(e) silently swallowing a TypeError caused by a bug in *your* code, not in the I/O? Monads make every failure point visible and composable. You can't "forget" to handle an error — it's there, in the return type, and Pyright flags it if you ignore it.

Let me tell you about the incident that sealed our decision. June 2025. One of our ingestion pipelines had a try/except Exception that logged errors and continued. For 6 days, an AttributeError — caused by a schema change on Binance's side — was silently swallowed. The pipeline kept running, health metrics were green, but the data being written was corrupted: a field had been renamed, and our code was deserializing None where it expected a float. Six days of corrupted data. Six days of false backtests. And nobody noticed because the exception wasn't surfacing. With a Result[DataFrame, SchemaError], the problem would have been visible on the first query: schema doesn't match → Err(SchemaError::FieldMissing("price")) → pipeline stops, alert fires, and you lose 5 minutes, not 6 days.

Result monad

Here is what Polarway's Result looks like in practice. The API is designed to feel familiar to Python developers: no exotic syntax, no macros, just chained methods. You compose operations with .map(), handle errors with .or_else(), and chain fallible transformations with .and_then(). If this reminds you of Java streams or Rust's Option — that's exactly the idea.

from polarway.async_client import Result

# Chain operations with map (Functor)
result: Result[int, str] = Result.ok(42)
doubled = result.map(lambda x: x * 2)  # Ok(84)

# Handle errors with or_else
result = Result.err("Failed to read file")
recovered = result.or_else(lambda e: Result.ok(0))  # Ok(0)

# Compose with and_then (flatMap / Monad bind)
def safe_divide(x: int) -> Result[float, str]:
    if x == 0:
        return Result.err("Division by zero")
    return Result.ok(100 / x)

result = Result.ok(5).and_then(safe_divide)  # Ok(20.0)

If you come from Haskell or Scala, these patterns seem obvious. But making them ergonomic in Python was a challenge. Python has no do-notation syntax. We opted for fluent chaining (.map().and_then().or_else()) which remains readable even for a Python developer who has never heard of monads. The goal wasn't to turn people into functionalists — it was to make code more reliable without requiring them to understand category theory.

We made design mistakes along the way. Our first API exposed methods like bind() and fmap() — the academic names. One of our junior developers spent 20 minutes understanding what .bind(lambda x: Result.ok(x + 1)) did. My fault. We renamed everything: bindand_then, fmapmap, pureok. Same concepts, same mathematical guarantees, but a vocabulary any Python developer can understand in 30 seconds. Haskell purists will resent us. But our target isn't Haskell purists — it's the data scientist who wants robust pipelines without having to read "Learn You a Haskell for Great Good".

Option monad

Option follows the same philosophy as Result, but for a different case: the absence of a value. Instead of returning None and crossing your fingers that the caller checks, Option makes absence explicit and composable. Every operation on an Option.nothing() is a silent no-op — no crash, no surprise NoneType.

from polarway.async_client import Option

# Handle nullable values
opt = Option.some(42)
doubled = opt.map(lambda x: x * 2)  # Some(84)

# Provide defaults
opt = Option.nothing()
value = opt.unwrap_or(0)  # 0

# Chain with and_then
opt = (Option.some("data.parquet")
       .and_then(lambda path: read_file(path))
       .and_then(lambda df: filter_df(df)))

Option eliminates an entire class of bugs: NoneType has no attribute. In standard Python, None is an accident waiting to happen — it propagates silently through 15 layers of code before exploding in a completely unexpected place. With Option, the absence of a value is encoded in the type. You're *forced* to decide what to do when there's nothing: provide a default with unwrap_or(), propagate with and_then(), or fail explicitly with unwrap().

Tony Hoare, the inventor of null, called it his "billion dollar mistake". Since 2009, every programming language conference cites that quote. And yet in 2026, the majority of production Python code still uses None as a return value with zero guardrails. Optional[int] in type hints is a step in the right direction, but Pyright won't stop you from calling .upper() on an Optional[str] at runtime — it *warns* you, and you ignore the warning because "it works in dev". Polarway's Option is different: opt.map(str.upper) never crashes, even if opt is Nothing. Safety isn't on the developer's honor — it's in the structure.

Functor: the categorical foundation

Result and Option are functors in the category-theoretic sense: they implement fmap which preserves structure. In Rust:

$$ \text{fmap} : (A \to B) \to F(A) \to F(B) $$ $$ \text{fmap}(f, \text{Ok}(x)) = \text{Ok}(f(x)) \qquad \text{fmap}(f, \text{Err}(e)) = \text{Err}(e) $$

Functor laws: fmap id = id and fmap (g . f) = fmap g . fmap f

Here is the concrete implementation of the Functor trait in Rust, applied to Result. The fmap method takes a function f: T → U and applies it to the contained value — if and only if the Result is Ok. In case of error, it returns the error untouched. This is the very definition of a functor: applying a transformation without changing the structure.

trait Functor<T> {
    type Output<U>;
    fn fmap<U, F>(self, f: F) -> Self::Output<U>
    where
        F: FnOnce(T) -> U;
}

impl<T, E> Functor<T> for Result<T, E> {
    type Output<U> = Result<U, E>;

    fn fmap<U, F>(self, f: F) -> Result<U, E>
    where
        F: FnOnce(T) -> U,
    {
        self.map(f)
    }
}

And here is the Python equivalent. Our implementation of Result.map() is a 1:1 mirror of the Rust Functor above. The difference is subtle but important: in Rust, the compiler *guarantees* that fmap respects the functor laws. In Python, it's our test suite that plays that role — 180 unit tests verify the algebraic properties of each monad.

class Result(Generic[T, E]):
    def map(self, f: Callable[[T], U]) -> Result[U, E]:
        """Functor: fmap for Result"""
        if self.is_ok():
            return Result.ok(f(self._value))
        return Result.err(self._error)

The beauty of the Functor is its composability. You can chain as many .map() calls as you want without ever checking whether the result is Ok or Err. If it's an error, it passes through the entire chain untouched — like water flowing through a closed pipe. This isn't magic, it's algebra: the functor laws guarantee that fmap preserves structure. And since Rust verifies these laws at compile time, we have mathematical certainty that our error pipeline is correct.

This is where category theory stops being an academic discussion topic and becomes an engineering tool. When I read a pipeline that does result.map(normalize).map(filter).map(aggregate), I know — with *mathematical* certainty, not just intuition — that if result is an error, all three operations are short-circuited and the original error is preserved intact. I don't need to read the code of normalize, filter, or aggregate. The functor laws guarantee it. This is a form of trust that try/except can never offer — with exceptions, you have to read *every* called function to know if it can throw, and *what type* of exception. The type system does the code review's job.

Monadic Pipeline — Error Short-Circuiting Ok path: Ok(data) .map() Ok(normalize) .map() Ok(filter) .map() Ok(aggregate) ✓ Result Err path: Err(e) skip — skipped — skip — skipped — skip — skipped — Err(e) Error propagates untouched through entire chain — zero runtime checks needed

Monadic pipeline: error passes through the chain untouched, each .map() is automatically short-circuited.

In practice: file processing without exceptions

Enough theory — here is what a *complete* pipeline looks like without a single exception. This code reads files in batch, filters successful results, logs errors functionally, and collects DataFrames. Not a single try/except in sight. Every failure point is visible in the types.

async def process_files(paths: list[str]) -> list[pd.DataFrame]:
    """Process files without exceptions — pure monadic pipeline"""

    async with AsyncPolarwayClient("localhost:50051") as client:
        # Read files — returns list[Result[Handle, Error]]
        results = await client.batch_read(paths)

        # Filter successful reads (no try/except!)
        handles = [r.unwrap() for r in results if r.is_ok()]

        # Log errors functionally
        for r in results:
            r.map_err(lambda e: print(f"⚠️ Read failed: {e}"))

        # Collect DataFrames
        tables = await client.batch_collect(handles)

        return [t.unwrap() for t in tables if t.is_ok()]

Zero try/except. Errors compose with map, and_then, or_else. If an error is ignored, the code won't compile (Rust) or the type checker flags it (Python + Pyright).

5. Real-world: WebSocket ingestion at 100k ticks/s

Here is the complete architecture of a real-time ingestion pipeline used in production at HFThot. This isn't a documentation example — it's code extracted (and simplified) from our actual system. If you work on market data ingestion, this pattern is probably directly applicable to your infrastructure.

This pipeline ingests crypto market data (BTC, ETH, and 30+ altcoins) from Binance WebSockets, transforms them into Polars DataFrames, and persists them as Parquet partitioned by day. It all runs on a single node with 4 cores. Before Polarway, this same pipeline required 3 nodes with Kafka in between to absorb the load. Today, infrastructure cost is divided by 3 and latency by 10.

The migration story deserves to be told in detail, because it contains lessons that benchmarks don't show. The old system had 3 components: a Python asyncio ingester that read WebSockets and pushed to Kafka, a Python consumer that read from Kafka and wrote Parquet, and an Airflow scheduler that orchestrated hourly batches. Three services, three deployments, three log stores, three Grafana dashboards. When something broke (and believe me, something broke at least once a week), you had to triangulate across 3 systems to understand whether the problem was the ingester, Kafka, or the consumer. Mean time to diagnose an incident was 45 minutes. With Polarway, it's a single process, a single log file, and mean time to diagnose dropped to 8 minutes.

BEFORE — 3 services, 3 failure points Python Ingester asyncio + WS Kafka 3 brokers + ZK Consumer Python → Parquet ✗ 3 deployments · 3 log stores · 3 Grafana dashboards ✗ MTTD: 45 min · Weekly restarts · OOM under spike Migration AFTER — 1 process, 0 Kafka Python Client Tokio mpsc bounded(1000) Rust Worker ✓ 1 deployment · 1 log file · MTTD: 8 min ✓ 4 months zero incidents · Infra cost ÷3 · Latency ÷10 Migration Impact Services 3 1 MTTD 45 min 8 min Restarts/week ~1 0 Kafka brokers 3 + ZK 0 Log stores 3 1 Nodes (CPU) 3 nodes 1 node Infra cost: ÷3 Latency: ÷10

Before/After: migration from 3 services + Kafka to a single Polarway process.

Python Client
await client.stream()
Tokio mpsc
Bounded channel (1000)
Rust Worker
serde + DataFrame
Parquet / Delta
Persistent storage

Rust Server: WebSocket → Tokio Channel → Storage

Here is the heart of the ingestion server. The architecture follows a classic producer-consumer pattern, but with a fundamental difference: the bounded mpsc channel creates natural backpressure between incoming WebSockets and the storage engine. Each WebSocket connection is handled in a separate Tokio task, and ticks are deserialized via serde_json before being sent into the channel.

use tokio::sync::mpsc;
use tokio_tungstenite::accept_async;

async fn websocket_ingest() -> Result<(), Error> {
    let (tx, mut rx) = mpsc::channel(1000); // Bounded = backpressure

    let listener = TcpListener::bind("0.0.0.0:8080").await?;

    // Accept WebSocket connections
    while let Ok((stream, _)) = listener.accept().await {
        let tx = tx.clone();

        tokio::spawn(async move {
            let ws = accept_async(stream).await?;
            let (_, mut receiver) = ws.split();

            while let Some(Ok(msg)) = receiver.next().await {
                let tick: MarketTick = serde_json::from_str(&msg.to_text()?)?;
                tx.send(tick).await?; // Backpressure if channel full
            }

            Ok::<_, Error>(())
        });
    }

    // Process ticks — separate Tokio task, parallel to receivers
    while let Some(tick) = rx.recv().await {
        let df = tick_to_dataframe(tick)?;
        store_to_polarway(df).await?;
    }

    Ok(())
}

The detail that changes everything: the mpsc channel is *bounded* at 1000. When the consumer (Parquet storage) is slow, the channel fills up. When it's full, the tx.send(tick).await call *suspends* the producer instead of dropping the message. The producer automatically slows down to the consumer's pace. This is native backpressure, with no configuration, no tuning, no Kafka. In asyncio, this same pattern requires an asyncio.Queue(maxsize=1000) — except the asyncio Queue is single-threaded and only supports a single efficient producer.

For people who've used Kafka, this simplification might seem suspicious. "How can you replace Kafka with an in-memory channel?" The answer is: we don't replace Kafka in the general case. Kafka is indispensable for distributed multi-consumer systems with replay. What we replace is the *misuse* of Kafka — when you use it as a buffer between two components *on the same machine* because asyncio has no backpressure. That's like taking a plane to cross the street. The mpsc channel is the right tool for intra-process communication, and Kafka remains the right tool for inter-cluster communication. Both have their place. But when you can solve the problem with 3 lines of Tokio instead of a Kafka cluster to administer, you sleep better at night.

Python Client: streaming with auto-reconnect

On the client side, the Python code is deliberately minimal. The polarway.stream() API returns an async iterator that you consume with a standard async for. Batching is handled by the user (here, every 1,000 ticks), and each write returns a Result rather than an exception — which lets you log errors without interrupting the flow.

from polarway.async_client import AsyncPolarwayClient, Result

async def ingest_market_data():
    async with AsyncPolarwayClient("localhost:50051") as polarway:
        batch: list[dict] = []

        async for tick in polarway.stream("btcusdt@trade"):
            batch.append(tick)

            if len(batch) >= 1000:
                # Batch write — returns Result, not exception
                result: Result = await polarway.write_batch(batch)
                result.map(lambda n: print(f"✅ Stored {n} ticks"))
                result.map_err(lambda e: print(f"⚠️ Write failed: {e}"))
                batch.clear()

Notice the complete absence of try/except in this production code. Each write returns a Result: success or failure, explicitly. Errors are logged via map_err without interrupting the flow. If a batch fails, the next one takes over. The stream never stops. This is both simpler and more robust than the equivalent asyncio version with its nested try/except ConnectionError blocks and finally: await cleanup() handlers.

This production code has been running 24/7 for 4 months. Zero unplanned restarts. Zero data loss. Zero on-call incidents related to ingestion. Before Polarway, the same pipeline required a weekly restart on average — sometimes due to OOM (asyncio Queue without bounds), sometimes due to an uncaught exception killing the event loop, sometimes due to a deadlock in the connection pool. The most frustrating were the "ghost" incidents: those resolved by systemctl restart that left no trace in the logs because the exception had been swallowed. With monads, every error is visible, every failover is explicit, and the log tells the complete story. The on-call engineer (often me, at 3AM) no longer needs to *guess* — they read.

6. Benchmarks with commentary

All benchmarks are run on AWS c6i.4xlarge (16 vCPU, 32 GB RAM) with a dataset of 50 Parquet files × 100 MB = 5 GB. Source code is available on GitHub and detailed results on ReadTheDocs. We publish benchmarks not to impress — anyone can publish favorable numbers — but to be held accountable. If you reproduce our tests and get different results, open an issue.

An important note on our methodology: each benchmark is run 10 times, the first 2 runs are discarded (warmup), and we report the median of the remaining 8. The dataset is generated from real market data (BTC/USDT ticks) replayed into Parquet files with Snappy compression. This isn't a synthetic benchmark with random integers — it's the same type of data our pipelines process in production.

Why the median and not the mean? Because the mean lies as much as exceptions do. A single outlier run (a Python GC triggering at the wrong time, a noisy neighbor on the AWS hypervisor) can shift the mean by 15%. The median is robust to outliers. Why discard the first 2 runs? Because CPU caches (L1/L2/L3), kernel page tables, and Python's potential JIT are cold on the first run. Comparing a cold first run to a warm run is misleading. This is methodological honesty, not cheating — and it's a nuance that most "Rust is 10x faster" blog posts completely ignore.

6.1 Batch read: speedup grows with load

Files Polars (sync) Polarway (async) Speedup
102.3s0.6s3.8x
5011.5s2.8s4.1x
10023.0s4.5s5.1x

Why does speedup increase? Tokio's work-stealing distributes I/O tasks across $c$ cores. With more files, Python's GIL saturation becomes the dominant bottleneck.

What's striking is the *non-linearity* on the Polars side. 10 files in 2.3s, 100 files in 23s — that's perfectly linear, $O(n)$. Polars doesn't parallelize file reads even if your machine has 16 cores. On the Polarway side, 10 files in 0.6s, 100 files in 4.5s — that's *not* linear. Speedup increases because the more tasks there are, the better Tokio can distribute the work. That's the theoretical advantage of work-stealing: the parallelism factor converges to the number of cores as $n \to \infty$.

An important nuance: Polars is an *extraordinary* tool. Ritchie Vink and his team have done remarkable work. The problem isn't Polars — it's that Polars lives in Python's universe, and that universe has a glass ceiling called the GIL. Polars parallelizes *within* a single operation (the Apache Arrow engine uses rayon for parallel scans), but it doesn't parallelize *between* operations when orchestrated from Python. That's the crucial distinction: intra-operation vs inter-operation parallelism. Polarway solves the latter by moving orchestration into Tokio. These aren't competing tools — they're complementary. Polarway uses Polars under the hood for DataFrame operations. We added the missing parallelism, we didn't replace the library.

6.2 Memory: constant vs linear

Dataset Polars (mem) Polarway (mem) Status
1 GB1.2 GB0.5 GB✅ / ✅
5 GB5.8 GB0.5 GB✅ / ✅
10 GB11.5 GB0.5 GB⚠️ / ✅
50 GBOOM ❌0.5 GB❌ / ✅

$$ M_{\text{polars}}(n) = O(n) \qquad M_{\text{polarway}}(n) = O(\text{batch\_size}) = O(1) $$

Polarway memory is independent of dataset size.

This is the benchmark that convinced our trading team. The "50 GB → OOM" line is the reality they face every week: a backtest on 3 months of tick-by-tick data for 40 crypto pairs easily exceeds 32 GB of RAM. The standard solution was to rent r7i.8xlarge machines (256 GB RAM, ~$1.60/h). With Polarway in streaming mode, a c6i.4xlarge (32 GB RAM, ~$0.68/h) is enough. Over a year of daily backtests, that's thousands of euros in savings — without changing a single strategy.

The technical trick is simple but powerful: Polarway never loads the entire dataset into memory. It reads Parquet files by row group (typically 50,000-100,000 rows), processes them, emits the result, and *frees the memory*. Python's garbage collector has nothing to do — Arrow buffers are allocated and freed on the Rust side with a deterministic allocator (jemalloc). Memory is a flat plateau, not an ascending ramp. It's the difference between reading a book page by page (O(1) memory) and photocopying the entire book before starting to read (O(n)). The metaphor is obvious, but it took 6 months of Rust to implement correctly — especially the zero-copy between row groups and the gRPC layer. The devil is in the buffers.

6.3 Concurrent queries: linear scalability

Queries per second (QPS)

1 query
Polars
10
Polarway
10
10 queries
Polars
25
Polarway
95
100 queries
Polars
60
Polarway
650
500 queries
Polars
70
Polarway 🔥
1200

The GIL wall: Polars saturates at ~70 QPS regardless of query count. Polarway scales linearly: $\text{QPS} \approx 2.4 \times n_{\text{queries}}$.

This chart is what we show skeptics. The Polars plateau at ~70 QPS is visible to the naked eye — that's the GIL wall. No matter how many queries you throw at it, Python can't process more than 70 per second because each query requires a minimum of CPU work (deserialization, filtering) that cannot be parallelized. On the Polarway side, the curve is quasi-linear. At 500 concurrent queries, we're at 1200 QPS — 17.1x above Python's ceiling. And the curve doesn't yet show saturation, suggesting the real limit is well beyond.

I want to be honest about what this chart does *not* show. At 1 sequential query, Polars and Polarway are both at 10 QPS. Polarway isn't faster on an isolated task — it's faster when your load increases. This is a crucial point that "Rust vs Python" articles often omit: the speedup isn't free, it manifests under load. If you have a weekly backtest script that processes one file at a time, Polarway won't help you. If you have a real-time dashboard with 50 concurrent users querying a 500GB lakehouse, Polarway is a game changer. Understanding *when* a tool is useful is as important as understanding *how* it works.

6.4 WebSocket streaming

Metric Value
Latency (p50)0.8 ms
Latency (p99)1.2 ms
Throughput120k ticks/sec
Memory (1M tick window)500 MB (constant)

7. What's next: Distributed Mode & Bus

Polarway doesn't stop at a single-process runtime. Here's what we're building — and why each piece of the puzzle is essential.

Today's Polarway is a single gRPC server exploiting a multi-threaded Tokio runtime. That's already enough to crush asyncio on a single node. But our vision goes further: we want the same client code await client.batch_read(paths) to work transparently across a cluster of nodes, without the user knowing (or needing to know) how many machines are processing their data. That's the transition from $O(n/c)$ to $O(n/(c \cdot k))$ where $k$ is the number of nodes.

This isn't just an engineer's dream. It's a market necessity. Crypto datasets grow exponentially: tick-by-tick for 500+ pairs across 3+ exchanges, that's several hundred GB per month. Today, a single Polarway node handles our current load. But we see the curve, and we know that by 2027, a single node won't be enough. The question isn't *if* we'll need distributed mode — it's *when*. And we'd rather have the infrastructure ready before demand arrives, not after.

🌐 Distributed Mode

Automatic distribution of Tokio tasks across a cluster of nodes. A distributed JoinSet that scales horizontally, with the same client code.

🚌 Event Bus

A pub/sub event bus integrated into Polarway. Publish a DataFrame, subscribe from any node. Backpressure and exactly-once guarantees.

📊 Query Federation

Federated SQL queries across heterogeneous sources (Parquet, Delta Lake, DuckDB, PostgreSQL) via a single gRPC endpoint.

🧪 REPL interactif

A Python REPL connected to the Tokio runtime for interactive exploration of distributed DataFrames with auto-completion.

The Bus is perhaps the feature we're most excited about. Imagine a pipeline where one node publishes DataFrames of raw ticks, another subscribes to compute real-time indicators, and a third persists everything to Delta Lake — all with exactly-once guarantees and end-to-end backpressure. It's Kafka, but without Kafka: no ZooKeeper cluster, no topic management, no consumer groups to configure. Just Rust and distributed Tokio channels.

If this sounds ambitious, it's because it is. We don't claim the Bus will replace Kafka for LinkedIn-scale use cases. But for a 5-50 person team processing financial data in real-time that has neither the expertise nor the budget to administer a Kafka cluster (3 brokers + ZooKeeper + Schema Registry + monitoring = minimum 6 machines), the Bus offers 80% of the functionality for 10% of the operational complexity. It's the Unix philosophy: do one thing, do it well, and let people compose. Kafka is an operating system. The Bus is a tool.

A personal word to close this section. When I started working on Polarway 18 months ago, I didn't know how to write Rust. I came from 10 years of Python. The borrow checker made me cry for the first 3 weeks. I couldn't understand why the compiler refused code that was "obviously correct". Then, slowly, something shifted. I started seeing errors not as obstacles but as guides. The compiler wasn't telling me "you're wrong" — it was telling me "you might have a data race here, and I can't prove it safe". It's a collaborator, not an adversary. And the day I compiled the first WebSocket pipeline without a single unsafe and it ran without correction for 30 consecutive days in production — that day, I understood why people fall in love with Rust. Not for the syntax (it's brutalist). For the *confidence*. When it compiles, it works. Truly.

$$ \text{Polarway}_{v2} : \underbrace{T_{\text{tokio}}(n, c)}_{\text{single-node}} \to \underbrace{T_{\text{distributed}}(n, c, k)}_{\text{k nodes}} = O\!\left(\frac{n}{c \cdot k}\right) $$

The next level: linear scalability with the number of nodes.

Try Polarway

Async pipeline, Result/Option monads, reproducible benchmarks. Everything is open-source.

📖 Documentation 💻 GitHub