18h ago
Software Engineer - Training Infrastructure
San Francisco
$165k-$330k / year
full-timesenior Hybridai-ml
🛠 Tech Stack
💼 About This Role
You'll architect and lead development of our training platform, supporting top-tier research engineers and model developers. You'll own scheduling, storage, networking, reliability, and observability of technical systems in the training stack. This role offers the chance to work on cutting-edge ML infrastructure for leading AI companies.
🎯 What You'll Do
- Design and architect scalable infrastructure systems for ML training platform
- Partner with developers to translate training requirements into technical solutions
- Design and architect a global training scheduler
- Design and architect reinforcement learning and continuous learning pipelines
- Drive long-term improvements to system reliability and development velocity
📋 Requirements
- Bachelor's degree in Computer Science or related field
- Proficiency in Go
- Deep expertise with Kubernetes in production environments
- Advanced understanding of distributed systems concepts and performance tuning
✨ Nice to Have
- Experience with distributed storage systems
- Experience with workload orchestration platforms like Temporal or Airflow
- Familiarity with open source training stack and frameworks (NCCL, PyTorch, Megatron, etc.)
🎁 Benefits & Perks
- 💰 Competitive compensation including meaningful equity
- 🏥 100% coverage of medical, dental, vision for employee and dependents
- 🏖️ Flexible PTO including company-wide Winter Break
- 👶 Paid parental leave
- 🌱 Fertility and family-building stipend through Carrot
0 0 0