1d ago
Senior Site Reliability Engineer - Observability
San Francisco, CA | San Jose, CA | Bellevue, WA
$240k-$401k / year
full-timelead Hybridai-ml
๐ Tech Stack
๐ผ About This Role
You'll deploy and operate observability platforms for logging, metrics, and distributed tracing at a leading AI cloud company. Your work will directly improve product reliability for thousands of customers. This role requires onsite presence in San Francisco, San Jose, or Bellevue 4 days per week.
๐ฏ What You'll Do
- Deploy and operate observability platforms for logging, metrics, and tracing
- Automate deployment and operation of observability systems
- Set up monitoring for modern AI/HPC cluster infrastructure
- Develop platform software to make observability adoptable
๐ Requirements
- 8+ years in software engineering
- 3+ years in Go
- 5+ years in Site Reliability Engineering practices
- Experience with Kubernetes for application deployment and monitoring
โจ Nice to Have
- Experience with Prometheus and PromQL
- Understanding of OpenTelemetry ecosystem and OTel collector
- Experience with Ansible or Terraform
๐ Benefits & Perks
- ๐ต Generous cash & equity compensation
- ๐ฅ Health, dental, and vision coverage
- ๐ช Wellness and commuter stipends
- ๐ฐ 401k Plan with 2% company match
- ๐๏ธ Flexible paid time off
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Onsite Interviewยท 4 hours
0 0 0