1d ago
Research Infrastructure Engineer, Training Systems
San Francisco
$295k-$380k / year
full-time Hybridai-ml
🛠 Tech Stack
💼 About This Role
You'll build ML training infrastructure that enables new model training approaches at scale. Your work will sit on the critical path for model releases, combining direct impact with the responsibility of building reliable systems.
🎯 What You'll Do
- Build infrastructure for large-scale model training and experimentation.
- Design APIs for complex training workflows.
- Improve reliability and performance of training and data pipelines.
- Debug issues across Python, PyTorch, distributed systems, and GPUs.
📋 Requirements
- Strong systems instincts and deep care for performance and reliability.
- Good taste in API design with empathy for users.
- Comfortable working across ML research code and production infrastructure.
- Experience debugging from evidence: profiles, traces, logs, tests.
🎁 Benefits & Perks
- 💰 $295K-$380K salary with equity
- 🏥 Comprehensive health benefits
- 🏖️ Flexible PTO
- 🏢 Hybrid work model in San Francisco
0 0 0