1d ago

Site Reliability Engineer

San Francisco, California

$350k-$475k / year

full-timeseniorai-ml Visa Sponsor

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll drive the reliability of Tinker, our fine-tuning API, end-to-end. You'll work alongside engineers and research teams to make every layer of the system more robust and resilient. This role offers the chance to shape reliability for a rapidly growing AI platform.

๐ŸŽฏ What You'll Do

  • Define and own end-to-end reliability for CI/CD to production.
  • Develop Service Level Objectives for distributed training systems.
  • Design and implement monitoring and observability across training paths.
  • Drive incident response and systematic improvements to prevent recurrence.

๐Ÿ“‹ Requirements

  • Bachelor's degree in CS or equivalent experience.
  • Experience in distributed systems or cloud infrastructure.
  • Proficiency in writing software for reliability automation.
  • Experience with production incident response and postmortems.

โœจ Nice to Have

  • Deep experience operating production cloud services at scale.
  • Background in distributed training frameworks and infrastructure failures.
  • Track record building checkpoint and recovery systems for long-running jobs.

๐ŸŽ Benefits & Perks

  • ๐Ÿฅ Health, dental, and vision insurance
  • ๐Ÿ–๏ธ Unlimited PTO
  • ๐Ÿ‘ถ Paid parental leave
  • ๐Ÿ“ฆ Relocation support

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Callยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Hiring Managerยท 45 min
0 0 0