7h ago
Senior Site Reliability Engineer
US
$149.1k-$157.8k / year
full-timesenior Remoteai-ml
π Tech Stack
πΌ About This Role
You'll lead reliability initiatives for a cutting-edge AI-driven platform, designing resilient infrastructure for AI pipelines. You'll shape observability, automation, and developer enablement to ensure scalable, reliable services. This fully remote role offers strong technical ownership and leadership influence.
π― What You'll Do
- Define SLIs, SLOs, and error budgets for production services and AI workloads
- Design resilient infrastructure patterns for AI pipelines and observability
- Lead incident response, disaster recovery, and post-incident reviews for long-term improvements
- Develop and maintain observability solutions using monitoring and tracing tools
- Manage infrastructure as code and automation strategies to improve operational efficiency
π Requirements
- 6β8 years of experience in Site Reliability Engineering, Platform Engineering, or DevOps
- Deep expertise with AWS, Kubernetes, Docker, Terraform, and GitOps
- Strong experience with observability platforms and distributed tracing
- Proficiency in Python and/or Bash scripting
β¨ Nice to Have
- Experience with Internal Developer Platform tools like Backstage
- Experience supporting AI/ML infrastructure, LLM integrations, or agentic systems
- Experience with FinOps, disaster recovery, policy-as-code, or regulated environments
π Benefits & Perks
- π° Competitive salary ($149,100 β $157,800)
- π₯ Medical, dental, and vision coverage
- ποΈ Flexible vacation policy
- π Company-sponsored training and professional development
- π§ Annual wellness and fitness reimbursement
π¨ Hiring Process
Estimated timeline: 2-4 weeks Β· AI estimate
- 1Recruiter screenΒ· 30 min
- 2Technical interviewΒ· 60 min
- 3Hiring manager interviewΒ· 45 min
0 0 0