6h ago
Supercompute Infrastructure Engineer
Menlo Park, Remote
✨ $200k-$300k / yearest.
full-timesenior Remoteai-ml
🛠 Tech Stack
💼 About This Role
You'll lead, design, and build large-scale compute clusters for AI scientific research. Your work will directly impact frontier research experiments. You'll operate clusters with thousands of GPUs using tools like Kubernetes and Slurm.
🎯 What You'll Do
- Design and build large-scale GPU/CPU compute clusters
- Write software for cluster orchestration and resource allocation
- Automate cluster lifecycle operations and monitoring
- Work on bringup, operations, and maintenance of clusters
📋 Requirements
- Experience with 5,000+ GPU clusters
- Expertise in Kubernetes and Slurm scheduling
- Proficiency in cloud platforms (GCP, AWS, Azure)
- Experience with observability tools like Prometheus or Datadog
✨ Nice to Have
- Experience with GitOps tools (ArgoCD, GitHub CI)
- Familiarity with IaC tools (Terraform, Ansible)
- Background in AI or scientific research infrastructure
🎁 Benefits & Perks
- 🏖️ Unlimited PTO
- 💰 Competitive equity
- 🏥 Health, dental, vision insurance
- 📚 Learning budget
0 0 0