13h ago

Research Engineer โ€“ Evals

San Francisco, CA

$160k-$240k / year

full-timesenior Hybridsoftware

๐Ÿ’ผ About This Role

You'll build the evaluation systems that tell us whether Firecrawl actually works, designing metrics and pipelines to measure output quality across millions of websites. You'll own the feedback loop from quality measurement back to model and product decisions, working closely with RL and Search/IR engineers to turn evaluations into training signals. This role offers deep technical ownership and the chance to define what "good" means for web data extraction at scale.

๐ŸŽฏ What You'll Do

  • Build eval stack from scratch, defining metrics and pipelines
  • Design benchmark datasets covering real-world distribution of customer data
  • Own LLM-as-judge pipelines for automated extraction quality scoring
  • Close the loop between evals and model training via RL/feedback signals
  • Run fast experiments and communicate results clearly to the team

๐Ÿ“‹ Requirements

  • 3+ years in ML engineering, applied AI, or data quality with production systems
  • Experience building eval infrastructure at scale, including pipeline and dataset curation
  • Deep understanding of LLM evaluation methodology, including LLM-as-judge pitfalls
  • Production experience with unstructured web data and quality metrics

โœจ Nice to Have

  • Experience with RLHF pipelines and reward modeling
  • Background in building human review tooling for data quality
  • Familiarity with web scraping, dynamic rendering, and SPAs

๐ŸŽ Benefits & Perks

  • ๐Ÿ–๏ธ Unlimited PTO
  • ๐Ÿ’ฐ Equity up to 0.10%
  • ๐Ÿข Hybrid or remote (Americas timezones)
  • ๐Ÿš€ High velocity rapid iteration and deployment

๐Ÿ“จ Hiring Process

Estimated timeline: 2-3 weeks ยท AI estimate

  1. 1Recruiter Screenยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Hiring Manager Chatยท 45 min
0 0 0