13h ago
Research Engineer โ Evals
San Francisco, CA
$160k-$240k / year
full-timesenior Hybridsoftware
๐ผ About This Role
You'll build the evaluation systems that tell us whether Firecrawl actually works, designing metrics and pipelines to measure output quality across millions of websites. You'll own the feedback loop from quality measurement back to model and product decisions, working closely with RL and Search/IR engineers to turn evaluations into training signals. This role offers deep technical ownership and the chance to define what "good" means for web data extraction at scale.
๐ฏ What You'll Do
- Build eval stack from scratch, defining metrics and pipelines
- Design benchmark datasets covering real-world distribution of customer data
- Own LLM-as-judge pipelines for automated extraction quality scoring
- Close the loop between evals and model training via RL/feedback signals
- Run fast experiments and communicate results clearly to the team
๐ Requirements
- 3+ years in ML engineering, applied AI, or data quality with production systems
- Experience building eval infrastructure at scale, including pipeline and dataset curation
- Deep understanding of LLM evaluation methodology, including LLM-as-judge pitfalls
- Production experience with unstructured web data and quality metrics
โจ Nice to Have
- Experience with RLHF pipelines and reward modeling
- Background in building human review tooling for data quality
- Familiarity with web scraping, dynamic rendering, and SPAs
๐ Benefits & Perks
- ๐๏ธ Unlimited PTO
- ๐ฐ Equity up to 0.10%
- ๐ข Hybrid or remote (Americas timezones)
- ๐ High velocity rapid iteration and deployment
๐จ Hiring Process
Estimated timeline: 2-3 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Hiring Manager Chatยท 45 min
0 0 0