7h ago

Senior Site Reliability Engineer

US

$149.1k-$157.8k / year

full-timesenior Remoteai-ml

πŸ›  Tech Stack

πŸ’Ό About This Role

You'll lead reliability initiatives for a cutting-edge AI-driven platform, designing resilient infrastructure for AI pipelines. You'll shape observability, automation, and developer enablement to ensure scalable, reliable services. This fully remote role offers strong technical ownership and leadership influence.

🎯 What You'll Do

  • Define SLIs, SLOs, and error budgets for production services and AI workloads
  • Design resilient infrastructure patterns for AI pipelines and observability
  • Lead incident response, disaster recovery, and post-incident reviews for long-term improvements
  • Develop and maintain observability solutions using monitoring and tracing tools
  • Manage infrastructure as code and automation strategies to improve operational efficiency

πŸ“‹ Requirements

  • 6–8 years of experience in Site Reliability Engineering, Platform Engineering, or DevOps
  • Deep expertise with AWS, Kubernetes, Docker, Terraform, and GitOps
  • Strong experience with observability platforms and distributed tracing
  • Proficiency in Python and/or Bash scripting

✨ Nice to Have

  • Experience with Internal Developer Platform tools like Backstage
  • Experience supporting AI/ML infrastructure, LLM integrations, or agentic systems
  • Experience with FinOps, disaster recovery, policy-as-code, or regulated environments

🎁 Benefits & Perks

  • πŸ’° Competitive salary ($149,100 – $157,800)
  • πŸ₯ Medical, dental, and vision coverage
  • πŸ–οΈ Flexible vacation policy
  • πŸ“š Company-sponsored training and professional development
  • 🧘 Annual wellness and fitness reimbursement

πŸ“¨ Hiring Process

Estimated timeline: 2-4 weeks Β· AI estimate

  1. 1Recruiter screenΒ· 30 min
  2. 2Technical interviewΒ· 60 min
  3. 3Hiring manager interviewΒ· 45 min
0 0 0