Senior Site Reliability Engineer - Observability at Lambda — CareerPair

1d ago

Senior Site Reliability Engineer - Observability

San Francisco, CA | San Jose, CA | Bellevue, WA

$240k-$401k / year

full-timelead Hybridai-ml

🛠 Tech Stack

💼 About This Role

You'll deploy and operate observability platforms for logging, metrics, and distributed tracing at a leading AI cloud company. Your work will directly improve product reliability for thousands of customers. This role requires onsite presence in San Francisco, San Jose, or Bellevue 4 days per week.

🎯 What You'll Do

Deploy and operate observability platforms for logging, metrics, and tracing
Automate deployment and operation of observability systems
Set up monitoring for modern AI/HPC cluster infrastructure
Develop platform software to make observability adoptable

📋 Requirements

8+ years in software engineering
3+ years in Go
5+ years in Site Reliability Engineering practices
Experience with Kubernetes for application deployment and monitoring

✨ Nice to Have

Experience with Prometheus and PromQL
Understanding of OpenTelemetry ecosystem and OTel collector
Experience with Ansible or Terraform

🎁 Benefits & Perks

💵 Generous cash & equity compensation
🏥 Health, dental, and vision coverage
💪 Wellness and commuter stipends
💰 401k Plan with 2% company match
🏖️ Flexible paid time off

📨 Hiring Process

Estimated timeline: 2-4 weeks · AI estimate

1Recruiter Screen· 30 min
2Technical Interview· 60 min
3Onsite Interview· 4 hours

Lambda

Lambda Jobs

Other jobs at Lambda

No other jobs found.

0 0 0