5h ago
Software Engineer, Frontier Clusters Infrastructure
San Francisco
$230k-$490k / year
full-timeai-ml
🛠 Tech Stack
💼 About This Role
You'll build and operate the next generation of compute clusters that power OpenAI's frontier research. You'll scale Kubernetes to massive size and automate bare-metal bring-up to deliver seamless supercomputing infrastructure. Your work will directly impact the speed and reliability of training the world's most advanced AI models.
🎯 What You'll Do
- Scale Kubernetes clusters with automation for provisioning and lifecycle management
- Build software abstractions unifying multiple clusters for training workloads
- Automate bare-metal node bring-up including firmware upgrades
- Improve operational metrics like cluster restart times and upgrade cycles
📋 Requirements
- Kubernetes internals and cluster scaling in high-growth environments
- Programming skills in Python or Go with Infrastructure-as-Code (Terraform)
- Experience with bare-metal Linux and large-scale networking
- Background in infrastructure or systems engineering for high-availability systems
✨ Nice to Have
- Experience with GPU workloads or HPC environments
- Familiarity with firmware management
🎁 Benefits & Perks
- 💰 Competitive equity package
- 🩺 Comprehensive health insurance
- 🏢 San Francisco office with onsite perks
- 🚀 Work on frontier AI research
0 0 0