2d ago

Site Reliability Engineer

Canada

$97k-$149.2k / year

full-timesenior Remotesoftware

🛠 Tech Stack

💼 About This Role

You'll build a reliability practice from the ground up, establishing SLOs, error budgets, and observability for AI workloads. You will drive the transition from reactive firefighting to proactive reliability, with autonomy to set the strategy. This role offers career-defining impact and the chance to architect incident response for a scaling platform.

🎯 What You'll Do

  • Define SLIs/SLOs for critical user journeys.
  • Lead live incident response as Incident Commander.
  • Design observability dashboards and actionable alerts.
  • Conduct production-readiness reviews and mentor engineers.

📋 Requirements

  • AWS expertise including VPC, IAM, ECS, EC2, Lambda.
  • SRE experience with SLI/SLO definition and error budgets.
  • Incident command experience in production environments.
  • Experience leading multi-region deployments.

✨ Nice to Have

  • Experience with Terraform or service mesh.
  • Familiarity with AI/ML workloads (LLMs, RAG).
  • Proficiency in Go or Python.

📨 Hiring Process

Estimated timeline: 2-4 weeks · AI estimate

  1. 1Recruiter Screen· 30 min
  2. 2Technical Interview· 60 min
  3. 3System Design Interview· 60 min
  4. 4Hiring Manager Chat· 45 min
0 0 0