6h ago

Supercompute Infrastructure Engineer

Menlo Park, Remote

$200k-$300k / yearest.

full-timesenior Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll lead, design, and build large-scale compute clusters for AI scientific research. Your work will directly impact frontier research experiments. You'll operate clusters with thousands of GPUs using tools like Kubernetes and Slurm.

🎯 What You'll Do

  • Design and build large-scale GPU/CPU compute clusters
  • Write software for cluster orchestration and resource allocation
  • Automate cluster lifecycle operations and monitoring
  • Work on bringup, operations, and maintenance of clusters

📋 Requirements

  • Experience with 5,000+ GPU clusters
  • Expertise in Kubernetes and Slurm scheduling
  • Proficiency in cloud platforms (GCP, AWS, Azure)
  • Experience with observability tools like Prometheus or Datadog

✨ Nice to Have

  • Experience with GitOps tools (ArgoCD, GitHub CI)
  • Familiarity with IaC tools (Terraform, Ansible)
  • Background in AI or scientific research infrastructure

🎁 Benefits & Perks

  • 🏖️ Unlimited PTO
  • 💰 Competitive equity
  • 🏥 Health, dental, vision insurance
  • 📚 Learning budget
0 0 0