8h ago

Research Engineer, RL Infrastructure and Reliability

San Francisco, CA

$350k-$850k / year

full-timeseniorai-ml Visa Sponsor

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll own the reliability, observability, and infrastructure foundation for the Knowledge Work team's training environments and evaluations. Your work will ensure stable, well-instrumented systems that proactively surface issues and let researchers focus on research. You'll be the trusted owner of evaluation integrity for model releases.

๐ŸŽฏ What You'll Do

  • Serve as dedicated reliability owner for Knowledge Work training environments
  • Own canonical evaluation tools and processes for model releases
  • Build observability dashboards and operational tooling with high signal-to-noise
  • Proactively harden systems via load testing, fault injection, and stress testing

๐Ÿ“‹ Requirements

  • Highly experienced Python engineer shipping production-trusted code
  • Experience operating ML or distributed systems at scale with on-call
  • Strong SRE or production-engineering mindset
  • Foundational ML knowledge to evaluate training environments

โœจ Nice to Have

  • 5+ years operating ML or distributed systems at scale
  • Experience building or operating RL environments, agent harnesses, or LLM evaluation frameworks
  • Familiarity with observability stacks and operational dashboard tooling

๐ŸŽ Benefits & Perks

  • ๐Ÿ’ฐ Competitive salary with $350k-$850k USD range
  • ๐Ÿข Hybrid work with San Francisco office
  • ๐ŸŒ Visa sponsorship offered
  • ๐Ÿง  Work on cutting-edge AI safety
  • ๐ŸŽ“ Bachelor's degree or equivalent required

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Screenยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Onsite Interviewsยท 3 hours
0 0 0