Chapter 1: Introduction
The Problem Space
Modern ML training at scale faces critical challenges:
GPU Failure Rates
- 0.1% GPU failure rate → 10% throughput loss with TP64
- Mean Time Between Failures (MTBF): 26-56 hours for large clusters
- Recovery requires full job restart, losing hours of progress
Current Solutions Fall Short
- Checkpointing: 30-40 minute overhead, still loses recent work
- Redundancy: Expensive, doesn't handle software failures
- Static allocation: Can't adapt to changing resources
The BEAM Inspiration
The Erlang/BEAM runtime achieves 99.999% uptime through:
Actor Model
- Lightweight processes with isolated state
- Message passing only communication
- No shared memory, no locks
Preemptive Scheduling
- Reduction-based fair scheduling
- No process monopolizes the system
- Predictable latency
Supervision Trees
- "Let it crash" philosophy
- Automatic restart strategies
- Hierarchical failure isolation
Hot Code Swapping
- Update running systems
- Zero downtime deployments
- Gradual rollout
zbmd Vision
Apply BEAM's proven principles to GPU/ML workloads:
Actors for Everything
pub const MLActor = union(enum) {
tensor: TensorActor, // Data containers
operator: OperatorActor, // Computations
layer: LayerActor, // Model components
optimizer: OptimizerActor, // Training logic
supervisor: SupervisorActor,// Fault handling
};
GPU Kernels as Actors
- Each CUDA kernel is a supervised actor
- Automatic retry on failure
- Migration to healthy GPUs
Distributed by Default
- Transparent multi-node operation
- Automatic work distribution
- Elastic scaling
Design Principles
1. Safety First (Tiger Style)
- No undefined behavior
- Fixed memory limits
- Bounded operations
- Fail fast with recovery
2. Zero-Cost Abstractions
- Comptime dispatch for backends
- No runtime overhead
- Direct GPU memory access
3. Test-Driven Development
- Tests alongside implementation
- Property-based testing
- Fault injection testing
4. Domain-Driven Design
- Clear bounded contexts
- Actor boundaries match domains
- Message contracts as APIs
Non-Goals
zbmd is NOT:
- A Python wrapper (pure Zig)
- A general actor framework (ML-focused)
- A distributed database (though it could be)
- A web framework (compute only)
Success Metrics
Performance
- < 1% overhead vs native CUDA
- Linear scaling to 10,000 GPUs
- Sub-millisecond failure detection
Reliability
- 99.999% uptime for training
- Zero data loss on failures
- Automatic recovery < 1 second
Usability
- Single binary deployment
- Zero configuration defaults
- Drop-in PyTorch replacement