1d ago

Research Infrastructure Engineer, Training Systems

San Francisco

$295k-$380k / year

full-time Hybridai-ml

🛠 Tech Stack

💼 About This Role

You'll build ML training infrastructure that enables new model training approaches at scale. Your work will sit on the critical path for model releases, combining direct impact with the responsibility of building reliable systems.

🎯 What You'll Do

  • Build infrastructure for large-scale model training and experimentation.
  • Design APIs for complex training workflows.
  • Improve reliability and performance of training and data pipelines.
  • Debug issues across Python, PyTorch, distributed systems, and GPUs.

📋 Requirements

  • Strong systems instincts and deep care for performance and reliability.
  • Good taste in API design with empathy for users.
  • Comfortable working across ML research code and production infrastructure.
  • Experience debugging from evidence: profiles, traces, logs, tests.

🎁 Benefits & Perks

  • 💰 $295K-$380K salary with equity
  • 🏥 Comprehensive health benefits
  • 🏖️ Flexible PTO
  • 🏢 Hybrid work model in San Francisco
0 0 0