18h ago

Software Engineer - Training Infrastructure

San Francisco

$165k-$330k / year

full-timesenior Hybridai-ml

🛠 Tech Stack

💼 About This Role

You'll architect and lead development of our training platform, supporting top-tier research engineers and model developers. You'll own scheduling, storage, networking, reliability, and observability of technical systems in the training stack. This role offers the chance to work on cutting-edge ML infrastructure for leading AI companies.

🎯 What You'll Do

  • Design and architect scalable infrastructure systems for ML training platform
  • Partner with developers to translate training requirements into technical solutions
  • Design and architect a global training scheduler
  • Design and architect reinforcement learning and continuous learning pipelines
  • Drive long-term improvements to system reliability and development velocity

📋 Requirements

  • Bachelor's degree in Computer Science or related field
  • Proficiency in Go
  • Deep expertise with Kubernetes in production environments
  • Advanced understanding of distributed systems concepts and performance tuning

✨ Nice to Have

  • Experience with distributed storage systems
  • Experience with workload orchestration platforms like Temporal or Airflow
  • Familiarity with open source training stack and frameworks (NCCL, PyTorch, Megatron, etc.)

🎁 Benefits & Perks

  • 💰 Competitive compensation including meaningful equity
  • 🏥 100% coverage of medical, dental, vision for employee and dependents
  • 🏖️ Flexible PTO including company-wide Winter Break
  • 👶 Paid parental leave
  • 🌱 Fertility and family-building stipend through Carrot
0 0 0