Akshay's Expedition Logs
Deep dives into distributed systems, ML infrastructure, and the architectures that make AI at scale actually work
Recent Posts

Distributed Systems
•2026-03-15
Production Ray: Patterns, Mistakes, and Lessons
The final part of our Ray deep dive: five production patterns that work, five mistakes that hurt, and a debugging playbook for distributed ML systems.

Distributed Systems
•2026-03-06
When Nodes Die: Ray's Fault Tolerance
How Ray detects failures, retries tasks, recovers actors, and reconstructs lost objects—plus production patterns for building resilient distributed ML pipelines.

Distributed Systems
•2026-03-01
Ray's GCS: The Brain Behind the Cluster
Deep dive into Ray's Global Control Store—how centralized metadata coordination powers a distributed system without becoming a bottleneck.