5h ago

Software Engineer, Frontier Clusters Infrastructure

San Francisco

$230k-$490k / year

full-timeai-ml

🛠 Tech Stack

💼 About This Role

You'll build and operate the next generation of compute clusters that power OpenAI's frontier research. You'll scale Kubernetes to massive size and automate bare-metal bring-up to deliver seamless supercomputing infrastructure. Your work will directly impact the speed and reliability of training the world's most advanced AI models.

🎯 What You'll Do

  • Scale Kubernetes clusters with automation for provisioning and lifecycle management
  • Build software abstractions unifying multiple clusters for training workloads
  • Automate bare-metal node bring-up including firmware upgrades
  • Improve operational metrics like cluster restart times and upgrade cycles

📋 Requirements

  • Kubernetes internals and cluster scaling in high-growth environments
  • Programming skills in Python or Go with Infrastructure-as-Code (Terraform)
  • Experience with bare-metal Linux and large-scale networking
  • Background in infrastructure or systems engineering for high-availability systems

✨ Nice to Have

  • Experience with GPU workloads or HPC environments
  • Familiarity with firmware management

🎁 Benefits & Perks

  • 💰 Competitive equity package
  • 🩺 Comprehensive health insurance
  • 🏢 San Francisco office with onsite perks
  • 🚀 Work on frontier AI research
0 0 0