8h ago

Staff Site Reliability Engineer

Playa Vista, CA or Remote

$180k-$250k / yearest.

full-timelead Remotee-commerce

🛠 Tech Stack

+3

💼 About This Role

You'll define and build the reliability foundation for Thrive Market's platform, working with a first-class group of engineers to establish SRE practices from the ground up. You'll balance hands-on reliability work with strategic thinking to build self-healing systems. This is a high-impact role at an exciting inflection point with a containerized platform on Kubernetes.

🎯 What You'll Do

  • Define and own SLOs and SLIs across critical platform services
  • Build and maintain monitoring and observability systems using Datadog, Prometheus, Grafana
  • Design and implement chaos engineering practices to proactively identify failure modes
  • Lead incident response and conduct blameless postmortems

📋 Requirements

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering
  • Deep expertise in Kubernetes including cluster management and Helm charts
  • Advanced scripting in Bash, Python, Golang, Ruby, or similar
  • Extensive experience with AWS services including EKS, EC2, S3, VPC, IAM

✨ Nice to Have

  • Experience with e-commerce platforms like Magento or Shopify
  • Experience with chaos engineering tools like Gremlin or Litmus
  • Familiarity with GitOps workflows and service mesh technologies

🎁 Benefits & Perks

  • 🏥 Comprehensive health benefits (medical, dental, vision, life, disability)
  • 💰 Competitive salary + equity
  • 🏖️ Flexible Paid Time Off
  • 🏋️ Subsidized ClassPass Membership
  • 🏦 401k plan
0 0 0