6h ago
ML Infra Engineer
San Francisco
✨ $175k-$265k / yearest.
full-timeseniorai-ml
🛠 Tech Stack
💼 About This Role
You'll scale and optimize training systems and core model code, owning critical infrastructure for large-scale training. You'll work with researchers and model engineers to translate ideas into experiments and production runs, at the intersection of ML, software engineering, and scalable infrastructure.
🎯 What You'll Do
- Design and maintain systems for large-scale model training scheduling and job management
- Scale distributed JAX training across TPU and GPU clusters
- Profile and optimize memory usage, device utilization, and throughput
- Build abstractions for launching, monitoring, and debugging experiments
📋 Requirements
- Strong software engineering fundamentals with experience in JAX or PyTorch
- Hands-on large-scale training experience
- Familiarity with distributed training, multi-host setups, and evaluation pipelines
- Experience managing training workloads on cloud platforms like Kubernetes or SLURM
✨ Nice to Have
- Deep ML systems background (training compilers, runtime optimization)
- Experience with GPU/TPU performance tuning
- Background in robotics or multimodal models
0 0 0