8h ago
Staff Cloud Site Reliability Engineer
London
✨ $180k-$250k / yearest.
full-timelead HybridAutonomous Vehicles / AI
🛠 Tech Stack
+2
💼 About This Role
You'll build and scale the reliability foundations of our AI cloud platform, including model development and GPU compute. You'll define SLOs, automation, and operational standards for large-scale distributed systems. This is a founding SRE role where you'll shape the function from scratch.
🎯 What You'll Do
- Own reliability, availability, and performance of cloud platforms.
- Define and operationalize SLOs, SLIs, and error budgets.
- Participate in 24/7 on-call rotation for incident response.
- Build automation for cluster operations and scaling tasks.
📋 Requirements
- Proven experience in an SRE role supporting large-scale cloud systems.
- Strong Kubernetes experience including production clusters.
- Hands-on experience with AWS, GCP, or Azure.
- Experience operating complex distributed systems with compute-heavy workloads.
✨ Nice to Have
- Experience operating GPU-backed environments or ML infrastructure.
- Familiarity with infrastructure-as-code (e.g., Terraform).
- Experience defining SLOs/SLIs and building reliability programs.
🎁 Benefits & Perks
- 🚀 Founding role with high impact on AI infrastructure.
- 💻 Hybrid work - 2 days per week in London office.
- 🌍 Work with cutting-edge Embodied AI technology.
0 0 0