8h ago
Staff Site Reliability Engineer
Playa Vista, CA or Remote
✨ $180k-$250k / yearest.
full-timelead Remotee-commerce
🛠 Tech Stack
+3
💼 About This Role
You'll define and build the reliability foundation for Thrive Market's platform, working with a first-class group of engineers to establish SRE practices from the ground up. You'll balance hands-on reliability work with strategic thinking to build self-healing systems. This is a high-impact role at an exciting inflection point with a containerized platform on Kubernetes.
🎯 What You'll Do
- Define and own SLOs and SLIs across critical platform services
- Build and maintain monitoring and observability systems using Datadog, Prometheus, Grafana
- Design and implement chaos engineering practices to proactively identify failure modes
- Lead incident response and conduct blameless postmortems
📋 Requirements
- 7+ years of experience in SRE, DevOps, or Infrastructure Engineering
- Deep expertise in Kubernetes including cluster management and Helm charts
- Advanced scripting in Bash, Python, Golang, Ruby, or similar
- Extensive experience with AWS services including EKS, EC2, S3, VPC, IAM
✨ Nice to Have
- Experience with e-commerce platforms like Magento or Shopify
- Experience with chaos engineering tools like Gremlin or Litmus
- Familiarity with GitOps workflows and service mesh technologies
🎁 Benefits & Perks
- 🏥 Comprehensive health benefits (medical, dental, vision, life, disability)
- 💰 Competitive salary + equity
- 🏖️ Flexible Paid Time Off
- 🏋️ Subsidized ClassPass Membership
- 🏦 401k plan
0 0 0