8h ago
Research Engineer, RL Infrastructure and Reliability
San Francisco, CA
$350k-$850k / year
full-timeseniorai-ml Visa Sponsor
๐ Tech Stack
๐ผ About This Role
You'll own the reliability, observability, and infrastructure foundation for the Knowledge Work team's training environments and evaluations. Your work will ensure stable, well-instrumented systems that proactively surface issues and let researchers focus on research. You'll be the trusted owner of evaluation integrity for model releases.
๐ฏ What You'll Do
- Serve as dedicated reliability owner for Knowledge Work training environments
- Own canonical evaluation tools and processes for model releases
- Build observability dashboards and operational tooling with high signal-to-noise
- Proactively harden systems via load testing, fault injection, and stress testing
๐ Requirements
- Highly experienced Python engineer shipping production-trusted code
- Experience operating ML or distributed systems at scale with on-call
- Strong SRE or production-engineering mindset
- Foundational ML knowledge to evaluate training environments
โจ Nice to Have
- 5+ years operating ML or distributed systems at scale
- Experience building or operating RL environments, agent harnesses, or LLM evaluation frameworks
- Familiarity with observability stacks and operational dashboard tooling
๐ Benefits & Perks
- ๐ฐ Competitive salary with $350k-$850k USD range
- ๐ข Hybrid work with San Francisco office
- ๐ Visa sponsorship offered
- ๐ง Work on cutting-edge AI safety
- ๐ Bachelor's degree or equivalent required
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Onsite Interviewsยท 3 hours
0 0 0