8h ago

Staff Cloud Site Reliability Engineer

London

$180k-$250k / yearest.

full-timelead HybridAutonomous Vehicles / AI

🛠 Tech Stack

+2

💼 About This Role

You'll build and scale the reliability foundations of our AI cloud platform, including model development and GPU compute. You'll define SLOs, automation, and operational standards for large-scale distributed systems. This is a founding SRE role where you'll shape the function from scratch.

🎯 What You'll Do

  • Own reliability, availability, and performance of cloud platforms.
  • Define and operationalize SLOs, SLIs, and error budgets.
  • Participate in 24/7 on-call rotation for incident response.
  • Build automation for cluster operations and scaling tasks.

📋 Requirements

  • Proven experience in an SRE role supporting large-scale cloud systems.
  • Strong Kubernetes experience including production clusters.
  • Hands-on experience with AWS, GCP, or Azure.
  • Experience operating complex distributed systems with compute-heavy workloads.

✨ Nice to Have

  • Experience operating GPU-backed environments or ML infrastructure.
  • Familiarity with infrastructure-as-code (e.g., Terraform).
  • Experience defining SLOs/SLIs and building reliability programs.

🎁 Benefits & Perks

  • 🚀 Founding role with high impact on AI infrastructure.
  • 💻 Hybrid work - 2 days per week in London office.
  • 🌍 Work with cutting-edge Embodied AI technology.
0 0 0