1d ago
Site Reliability Engineer
San Francisco, California
$350k-$475k / year
full-timeseniorai-ml Visa Sponsor
๐ Tech Stack
๐ผ About This Role
You'll drive the reliability of Tinker, our fine-tuning API, end-to-end. You'll work alongside engineers and research teams to make every layer of the system more robust and resilient. This role offers the chance to shape reliability for a rapidly growing AI platform.
๐ฏ What You'll Do
- Define and own end-to-end reliability for CI/CD to production.
- Develop Service Level Objectives for distributed training systems.
- Design and implement monitoring and observability across training paths.
- Drive incident response and systematic improvements to prevent recurrence.
๐ Requirements
- Bachelor's degree in CS or equivalent experience.
- Experience in distributed systems or cloud infrastructure.
- Proficiency in writing software for reliability automation.
- Experience with production incident response and postmortems.
โจ Nice to Have
- Deep experience operating production cloud services at scale.
- Background in distributed training frameworks and infrastructure failures.
- Track record building checkpoint and recovery systems for long-running jobs.
๐ Benefits & Perks
- ๐ฅ Health, dental, and vision insurance
- ๐๏ธ Unlimited PTO
- ๐ถ Paid parental leave
- ๐ฆ Relocation support
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Callยท 30 min
- 2Technical Interviewยท 60 min
- 3Hiring Managerยท 45 min
0 0 0