1d ago

Senior Site Reliability Engineer - Observability

San Francisco, CA | San Jose, CA | Bellevue, WA

$240k-$401k / year

full-timelead Hybridai-ml

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll deploy and operate observability platforms for logging, metrics, and distributed tracing at a leading AI cloud company. Your work will directly improve product reliability for thousands of customers. This role requires onsite presence in San Francisco, San Jose, or Bellevue 4 days per week.

๐ŸŽฏ What You'll Do

  • Deploy and operate observability platforms for logging, metrics, and tracing
  • Automate deployment and operation of observability systems
  • Set up monitoring for modern AI/HPC cluster infrastructure
  • Develop platform software to make observability adoptable

๐Ÿ“‹ Requirements

  • 8+ years in software engineering
  • 3+ years in Go
  • 5+ years in Site Reliability Engineering practices
  • Experience with Kubernetes for application deployment and monitoring

โœจ Nice to Have

  • Experience with Prometheus and PromQL
  • Understanding of OpenTelemetry ecosystem and OTel collector
  • Experience with Ansible or Terraform

๐ŸŽ Benefits & Perks

  • ๐Ÿ’ต Generous cash & equity compensation
  • ๐Ÿฅ Health, dental, and vision coverage
  • ๐Ÿ’ช Wellness and commuter stipends
  • ๐Ÿ’ฐ 401k Plan with 2% company match
  • ๐Ÿ–๏ธ Flexible paid time off

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Screenยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Onsite Interviewยท 4 hours
0 0 0