Why Ray? From Python Scripts to Distributed Clusters

I first encountered Ray at Mechademy, where we were building ML models for turbomachinery monitoring—not one model but thousands. Each turbine or compressor behaved differently, so a single model wouldn't cut it.

Scaling to hundreds of clients meant training and maintaining thousands of models. Our Python code worked for a few, but scaling while keeping client data secure and training fast was impossible.

That's when we rebuilt our ML lifecycle around Ray, Dask for data processing, and MLflow for experiment tracking. It took real work to architect, but the payoff changed how we thought about the problem.

The "click" moment came when we trained an XGBoost model on data larger than any single machine's RAM. The dataset was distributed, the training parallel—and it just worked. No manual partitioning or orchestration code. Ray and Dask handled the distribution; we focused on the ML.

And we weren't alone. Teams across industries—retail, finance, design—were hitting the same wall.

Instacart, for example, trained thousands of regional ML models. Their Celery setup ran at 10–15% CPU utilization and 4-hour training times. After moving to Ray, CPU utilization hit 80%, and runs dropped to 20 minutes—same hardware, same models, better execution.

Coinbase saw similar gains: their 120-minute data-transform pipeline fell to 15 minutes, letting them run 15× more ML jobs at the same cost.

Traditional tools like Celery, Spark, or single-machine workflows worked fine at small scale but broke down when you needed to:

Train many models in parallel
Handle data larger than a single machine's memory
Iterate quickly on experiments
Actually use compute efficiently

Why did these tools all hit the same wall—and what was Ray doing differently?

Let's look at Ray in action, then explore why it works where others stumble.

A Pattern Emerges

Across companies, the story repeated: different domains, same scaling pain.

Coinbase had a monolithic ML platform—slow pipelines, redundant hyperparameter runs, hours-long iteration cycles.

Uber needed massive hyperparameter searches but couldn't dynamically allocate heterogeneous compute.

Canva trained hundreds of ML models, yet single-machine limits kept GPU utilization low and training expensive.

What They Tried Before Ray

Most teams tried familiar tools:

Celery — mature, Python-native, great for async queues—but not built for compute-heavy work. Workers were overprovisioned, often 10–15% busy.

Spark — excellent for ETL and batch analytics, but its JVM base and batch execution model made iterative ML training awkward.

Single-machine multiprocessing — fine for prototypes, but bounded by one machine's RAM and manual coordination.

Ad-hoc distributed systems — possible, if you wanted to maintain your own distributed infrastructure.

Why They Fell Short

None of these tools were bad—they just optimized for narrow workloads. Celery excelled at queues, Spark at DataFrames, multiprocessing at local parallelism.

ML pipelines need all of it:

Fine-grained parallelism for many small models
Stateful services for custom loops or serving
Efficient data sharing
Dynamic execution graphs
Heterogeneous resources (CPU, GPU, memory)

Existing frameworks made you pick one dimension. Modern ML needs them all.

So what makes Ray work where others fail? Let's see with a quick example.

Enter Ray: A Simple Example

Before diving into architecture, let's look at a simple parallel preprocessing example.

The Scenario

You have eight datasets to process: load data, compute SVD, and preprocess. Each takes about 1.25 seconds. Sequentially, that's around 10 seconds total.

Sequential Python

import time
import numpy as np
 
def extract_features(seed):
    rng = np.random.default_rng(seed)
    data = rng.standard_normal((800, 800))
    _ = np.linalg.svd(data, full_matrices=False)
    time.sleep(0.25)
    return True
 
start = time.perf_counter()
for i in range(8):
    extract_features(i)
print(f"Time: {time.perf_counter() - start:.2f}s")
# Output: ~10 seconds

The Ray Version

import time
import numpy as np
import ray
 
@ray.remote
def extract_features(seed):
    rng = np.random.default_rng(seed)
    data = rng.standard_normal((800, 800))
    _ = np.linalg.svd(data, full_matrices=False)
    time.sleep(0.25)
    return True
 
ray.init()
start = time.perf_counter()
futures = [extract_features.remote(i) for i in range(8)]
ray.get(futures)
print(f"Time: {time.perf_counter() - start:.2f}s")
ray.shutdown()
# Output: ~2.5 seconds on 4 cores

A 4-core laptop finishes in about 2.5 seconds—a 4× speedup.

What Just Happened?

Three simple changes make this work:

@ray.remote marks the function as executable on any worker—the logic itself stays identical
.remote() schedules execution asynchronously and returns a future (a promise of a result)
ray.get() blocks until all tasks complete and returns the actual results

Ray starts worker processes on each core, distributes tasks, and tracks results—no manual thread management, process pools, or synchronization code needed.

Why This Works

When you call ray.init(), Ray starts worker processes (one per CPU core by default). Each .remote() call submits a task to Ray's scheduler, which distributes work across available workers.

In our example:

Sequential version: 8 tasks × 1.25 seconds each = 10 seconds
Ray version: 8 tasks across 4 workers = ~2.5 seconds (10 ÷ 4)

Your 4 cores finally work in parallel—no threads, pools, or locks needed.

Why It Matters

You didn't rewrite your code; you just annotated it. That same pattern scales to:

Hundreds of parallel model training jobs
Hyperparameter sweeps across configurations
Processing datasets larger than memory
Building stateful distributed services

Now, let's see why Ray can do this when others can't.

Why Ray Is Different

If distributed execution can be that simple, why weren't Celery or Spark doing it?

The Design Insight

Around 2018, the landscape looked like this:

Spark dominated ETL but forced DataFrame thinking
Dask gave Python parallelism but lacked stateful workloads
Celery handled queues but wasted compute

Each solved one dimension—task, data, or state—not all three.

ML pipelines, however, blur those lines: preprocess in parallel, train on distributed data, serve stateful models, orchestrate thousands of trials.

Ray's idea was bold yet simple: one framework unifying tasks, data, and state.

Three Key Innovations

1. Unified Task + Actor Model

Tasks are stateless functions; actors are stateful Python classes.

@ray.remote
class ModelServer:
    def __init__(self, path):
        self.model = load_model(path)
    
    def predict(self, data):
        return self.model.predict(data)
 
server = ModelServer.remote("model.pkl")
result = ray.get(server.predict.remote(new_data))

Actors stay alive, hold state, and can launch tasks—enabling reinforcement learning agents, hyperparameter coordinators, or custom training loops within one framework.

2. Python-Native Runtime

Unlike Spark's JVM bridge, Ray is built for Python. Your NumPy arrays and PyTorch tensors stay native, avoiding serialization overhead and Python-to-JVM translation penalties.

When Coinbase moved from Spark to Ray, that 8× speedup wasn't just better scheduling—it was eliminating the constant Python/JVM translation layer.

3. Fine-Grained, Flexible Scheduling

Ray's distributed scheduler handles millions of tasks per second. You control parallelism—spawn 10,000 small tasks, decide next steps dynamically, or assign custom resource needs per task ("2 CPUs here, 1 GPU there").

This flexibility let teams like Uber build sophisticated systems like Autotune without fighting the framework.

The Results

Remember those metrics from earlier?

Instacart: 10% → 80% CPU utilization
Coinbase: 8× faster transforms, 15× more jobs
Canva: 12× faster iteration cycles

These aren't mere speedups—they're proof of architecture built for ML from the ground up.

When you write @ray.remote, you're tapping into a runtime built for the dynamic reality of ML engineering.

So what happens under the hood when you call ray.init()?

How Ray Does It (Architecture Preview)

Calling ray.init() launches a distributed system, not just a library. Three core components coordinate to schedule tasks and manage data.

1. Raylet (The Node Manager)

Every machine runs a Raylet process that handles:

Task scheduling on the node
Resource tracking (CPU, GPU, memory)
Worker process management

Each .remote() call becomes a task the Raylet assigns to a worker.

2. Global Control Store (GCS) — The Brain

The GCS maintains the cluster-wide view: alive nodes, object locations, task metadata, and resources. When data is needed across nodes, the GCS directs the transfer.

3. Object Store (Shared Memory)

Each node hosts an object store in shared memory. Task results go there; subsequent tasks read by reference without re-serialization. One 10GB array, many readers, no copies.

How It All Flows

1. ray.init()
   → starts Raylet + Object Store, connects to GCS

2. extract_features.remote(0)
   → scheduled on worker, result stored in Object Store

3. ray.get()
   → fetches from local or remote store

Everything happens transparently. You write simple Python; Ray handles execution and data placement.

Why It Scales

Fast dispatch: Distributed scheduling achieves million-task-per-second rates
Efficient data sharing: Shared memory minimizes copies
Fault tolerance: GCS knows what failed and re-runs tasks
Flexible execution: Tasks and actors are entries with resource tags, allowing smart placement decisions

What's Next

This is the 30,000-foot view. In Part 2, we'll trace ray.init() line by line—how Ray boots its runtime, registers nodes, and prepares to execute tasks.

The series roadmap:

Part 3: Task and actor serialization & execution
Part 4: Scheduling & resource management
Part 5: Object store and data transfer
Part 6: GCS and cluster coordination
Part 7: Fault tolerance and recovery
Part 8: Production patterns and best practices

Before continuing, try the example yourself:

pip install ray numpy
python ray_example.py

Watch your CPU cores light up and see parallelism in action. Use htop or Activity Monitor to visualize all cores working simultaneously—it's surprisingly satisfying.

This series shares what I've learned working on ML infrastructure at Mechademy, where we use Ray to train thousands of custom turbomachinery models. Part 2 coming soon.