18h ago

Senior Staff Cloud Support Engineer

San Francisco, CA

$180k-$220k / year

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll lead technical escalations and reliability architecture for Crusoe Cloud's AI infrastructure. Your core impact includes reducing incident MTTR and designing systemic fixes across compute, networking, and orchestration layers. This role offers the opportunity to shape high-performance AI clusters at a vertically integrated energy-first company.

🎯 What You'll Do

  • Serve as highest-level escalation point for complex P1/P0 incidents.
  • Lead cross-functional root cause investigations across compute, networking, storage, and orchestration.
  • Design and improve node validation, burn-in processes, and performance baselining.
  • Act as senior technical advisor during high-risk customer incidents.

📋 Requirements

  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure.
  • Advanced Linux systems expertise.
  • Deep Kubernetes operational experience (CKA-level or higher).
  • Networking knowledge: Infiniband, RDMA, RoCE, SDN.

✨ Nice to Have

  • Experience with Slurm workload orchestration.
  • Experience with Terraform.
  • Experience with NCCL and GPU driver/firmware troubleshooting.

🎁 Benefits & Perks

  • 💰 Competitive compensation
  • 📈 Restricted Stock Units
  • 🏖️ Paid time off & holidays
  • 🩺 Comprehensive health, dental & vision insurance
  • 👶 Paid parental leave
0 0 0