Site Reliability Engineer at OpenAI

5h ago

Site Reliability Engineer

San Francisco

$255k-$490k / year

full-timeai-ml

🛠 Tech Stack

💼 About This Role

You'll operate the next generation of compute clusters powering OpenAI's frontier research, blending distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale Kubernetes clusters to massive scale, automate bare-metal bring-up, and build the software layer that hides complexity across multiple data centers. Expect to manage fast-moving operations and continuously raise the bar for automation and uptime.

🎯 What You'll Do

Spin up and scale large Kubernetes clusters with automation for lifecycle management
Build software abstractions that unify multiple clusters for training workloads
Own node bring-up from bare metal through firmware upgrades at massive scale
Improve operational metrics like cluster restart times and firmware upgrade cycles

📋 Requirements

Experience as an infrastructure or systems engineer in large-scale environments
Strong knowledge of Kubernetes internals and cluster scaling patterns
Proficiency in cloud infrastructure concepts and automation

✨ Nice to Have

Background with GPU workloads
Experience with firmware management
Familiarity with high-performance computing

🎁 Benefits & Perks

💰 Equity offered
🏖️ Flexible time off (implied by culture)
🏥 Comprehensive health insurance (likely, not stated)

OpenAI

OpenAI Jobs

Other jobs at OpenAI

No other jobs found.

0 0 0