5h ago
ML Infra Engineer
San Francisco
✨ $220k-$350k / yearest.
full-timeai-ml
🛠 Tech Stack
💼 About This Role
You'll help scale and optimize training systems and core model code, owning critical infrastructure for large-scale training. You'll work closely with researchers to translate ideas into experiments and production training runs. This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.
🎯 What You'll Do
- Own training/inference infrastructure: scheduling, job management, checkpointing, metrics/logging.
- Scale distributed training across TPU and GPU clusters with JAX.
- Profile and optimize memory usage, device utilization, and throughput.
- Build abstractions for launching, monitoring, debugging, and reproducing experiments.
📋 Requirements
- Strong software engineering fundamentals and experience building ML training infrastructure.
- Hands-on large-scale training experience in JAX (preferred) or PyTorch.
- Experience managing training workloads on cloud platforms (e.g., Kubernetes, GCP TPU/GKE, AWS).
- Ability to debug and optimize performance bottlenecks across the training stack.
✨ Nice to Have
- Deep ML systems background (training compilers, runtime optimization, custom kernels).
- Experience operating close to hardware (GPU/TPU performance tuning).
- Background in robotics, multimodal models, or large-scale foundation models.
🎁 Benefits & Perks
- 🏖️ Unlimited PTO
- 💰 Competitive salary and equity
- 🏥 Health, dental, and vision insurance
- 🍱 Daily meals and snacks
- 🚀 Opportunity to work on cutting-edge AI
0 0 0