Distributed Systems Posts
8 posts in this category

Production Ray: Patterns, Mistakes, and Lessons
The final part of our Ray deep dive: five production patterns that work, five mistakes that hurt, and a debugging playbook for distributed ML systems.

When Nodes Die: Ray's Fault Tolerance
How Ray detects failures, retries tasks, recovers actors, and reconstructs lost objects—plus production patterns for building resilient distributed ML pipelines.

Ray's GCS: The Brain Behind the Cluster
Deep dive into Ray's Global Control Store—how centralized metadata coordination powers a distributed system without becoming a bottleneck.

Shared Memory, Zero-Copy, and the Object Store
Deep dive into Ray's object store architecture: how shared memory enables zero-copy transfers, plasma store implementation, and patterns for efficient distributed data handling.

Scheduling and Resource Management
Understanding Ray's intelligent scheduler—from hybrid scheduling and resource control to coordination tuning and data locality optimization.

Tasks, Actors, and the Execution Model
Understanding Ray's two execution models—stateless tasks for parallel work and stateful actors for coordination—and when to use each.

Inside Ray: What Happens When You Hit Start
Ray isn't just an API—it's a distributed company with executives, managers, and workers. Learn the runtime components (GCS, raylet, object store) and practical debugging steps for Kubernetes-based Ray clusters.

Why Ray? From Python Scripts to Distributed Clusters
Part 1 of an 8-part deep dive into Ray's architecture. How Ray transforms simple Python code into distributed execution, and why it succeeds where Celery, Spark, and other tools struggle.