Chapter 1: Introduction

The Problem Space

Modern ML training at scale faces critical challenges:

GPU Failure Rates

0.1% GPU failure rate → 10% throughput loss with TP64
Mean Time Between Failures (MTBF): 26-56 hours for large clusters
Recovery requires full job restart, losing hours of progress

Current Solutions Fall Short

Checkpointing: 30-40 minute overhead, still loses recent work
Redundancy: Expensive, doesn't handle software failures
Static allocation: Can't adapt to changing resources

The BEAM Inspiration

The Erlang/BEAM runtime achieves 99.999% uptime through:

Actor Model

Lightweight processes with isolated state
Message passing only communication
No shared memory, no locks

Preemptive Scheduling

Reduction-based fair scheduling
No process monopolizes the system
Predictable latency

Supervision Trees

"Let it crash" philosophy
Automatic restart strategies
Hierarchical failure isolation

Hot Code Swapping

Update running systems
Zero downtime deployments
Gradual rollout

zbmd Vision

Apply BEAM's proven principles to GPU/ML workloads:

Actors for Everything

pub const MLActor = union(enum) {
    tensor: TensorActor,        // Data containers
    operator: OperatorActor,    // Computations
    layer: LayerActor,          // Model components
    optimizer: OptimizerActor,  // Training logic
    supervisor: SupervisorActor,// Fault handling
};

GPU Kernels as Actors

Each CUDA kernel is a supervised actor
Automatic retry on failure
Migration to healthy GPUs

Distributed by Default

Transparent multi-node operation
Automatic work distribution
Elastic scaling

Design Principles

1. Safety First (Tiger Style)

No undefined behavior
Fixed memory limits
Bounded operations
Fail fast with recovery

2. Zero-Cost Abstractions

Comptime dispatch for backends
No runtime overhead
Direct GPU memory access

3. Test-Driven Development

Tests alongside implementation
Property-based testing
Fault injection testing

4. Domain-Driven Design

Clear bounded contexts
Actor boundaries match domains
Message contracts as APIs

Non-Goals

zbmd is NOT:

A Python wrapper (pure Zig)
A general actor framework (ML-focused)
A distributed database (though it could be)
A web framework (compute only)

Success Metrics

Performance

< 1% overhead vs native CUDA
Linear scaling to 10,000 GPUs
Sub-millisecond failure detection

Reliability

99.999% uptime for training
Zero data loss on failures
Automatic recovery < 1 second

Usability

Single binary deployment
Zero configuration defaults
Drop-in PyTorch replacement