5h ago

Site Reliability Engineer

San Francisco

$255k-$490k / year

full-timeai-ml

🛠 Tech Stack

💼 About This Role

You'll operate the next generation of compute clusters powering OpenAI's frontier research, blending distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides complexity across multiple data centers. Expect to manage fast-moving operations and continuously raise the bar for automation and uptime.

🎯 What You'll Do

  • Spin up and scale large Kubernetes clusters with automation for lifecycle management
  • Build software abstractions that unify multiple clusters for training workloads
  • Own node bring-up from bare metal through firmware upgrades at massive scale
  • Improve operational metrics like cluster restart times and firmware upgrade cycles

📋 Requirements

  • Experience as an infrastructure or systems engineer in large-scale environments
  • Strong knowledge of Kubernetes internals and cluster scaling patterns
  • Proficiency in cloud infrastructure concepts and automation

✨ Nice to Have

  • Background with GPU workloads
  • Experience with firmware management
  • Familiarity with high-performance computing

🎁 Benefits & Perks

  • 💰 Equity offered
  • 🏖️ Flexible time off (implied by culture)
  • 🏥 Comprehensive health insurance (likely, not stated)
0 0 0