5h ago
Site Reliability Engineer
San Francisco
$255k-$490k / year
full-timeai-ml
🛠 Tech Stack
💼 About This Role
You'll operate the next generation of compute clusters powering OpenAI's frontier research, blending distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides complexity across multiple data centers. Expect to manage fast-moving operations and continuously raise the bar for automation and uptime.
🎯 What You'll Do
- Spin up and scale large Kubernetes clusters with automation for lifecycle management
- Build software abstractions that unify multiple clusters for training workloads
- Own node bring-up from bare metal through firmware upgrades at massive scale
- Improve operational metrics like cluster restart times and firmware upgrade cycles
📋 Requirements
- Experience as an infrastructure or systems engineer in large-scale environments
- Strong knowledge of Kubernetes internals and cluster scaling patterns
- Proficiency in cloud infrastructure concepts and automation
✨ Nice to Have
- Background with GPU workloads
- Experience with firmware management
- Familiarity with high-performance computing
🎁 Benefits & Perks
- 💰 Equity offered
- 🏖️ Flexible time off (implied by culture)
- 🏥 Comprehensive health insurance (likely, not stated)
0 0 0