Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 1: Introduction

The Problem Space

Modern ML training at scale faces critical challenges:

GPU Failure Rates

  • 0.1% GPU failure rate → 10% throughput loss with TP64
  • Mean Time Between Failures (MTBF): 26-56 hours for large clusters
  • Recovery requires full job restart, losing hours of progress

Current Solutions Fall Short

  • Checkpointing: 30-40 minute overhead, still loses recent work
  • Redundancy: Expensive, doesn't handle software failures
  • Static allocation: Can't adapt to changing resources

The BEAM Inspiration

The Erlang/BEAM runtime achieves 99.999% uptime through:

Actor Model

  • Lightweight processes with isolated state
  • Message passing only communication
  • No shared memory, no locks

Preemptive Scheduling

  • Reduction-based fair scheduling
  • No process monopolizes the system
  • Predictable latency

Supervision Trees

  • "Let it crash" philosophy
  • Automatic restart strategies
  • Hierarchical failure isolation

Hot Code Swapping

  • Update running systems
  • Zero downtime deployments
  • Gradual rollout

zbmd Vision

Apply BEAM's proven principles to GPU/ML workloads:

Actors for Everything

pub const MLActor = union(enum) {
    tensor: TensorActor,        // Data containers
    operator: OperatorActor,    // Computations
    layer: LayerActor,          // Model components
    optimizer: OptimizerActor,  // Training logic
    supervisor: SupervisorActor,// Fault handling
};

GPU Kernels as Actors

  • Each CUDA kernel is a supervised actor
  • Automatic retry on failure
  • Migration to healthy GPUs

Distributed by Default

  • Transparent multi-node operation
  • Automatic work distribution
  • Elastic scaling

Design Principles

1. Safety First (Tiger Style)

  • No undefined behavior
  • Fixed memory limits
  • Bounded operations
  • Fail fast with recovery

2. Zero-Cost Abstractions

  • Comptime dispatch for backends
  • No runtime overhead
  • Direct GPU memory access

3. Test-Driven Development

  • Tests alongside implementation
  • Property-based testing
  • Fault injection testing

4. Domain-Driven Design

  • Clear bounded contexts
  • Actor boundaries match domains
  • Message contracts as APIs

Non-Goals

zbmd is NOT:

  • A Python wrapper (pure Zig)
  • A general actor framework (ML-focused)
  • A distributed database (though it could be)
  • A web framework (compute only)

Success Metrics

Performance

  • < 1% overhead vs native CUDA
  • Linear scaling to 10,000 GPUs
  • Sub-millisecond failure detection

Reliability

  • 99.999% uptime for training
  • Zero data loss on failures
  • Automatic recovery < 1 second

Usability

  • Single binary deployment
  • Zero configuration defaults
  • Drop-in PyTorch replacement