about 5 hours ago
Senior Staff Machine Learning Engineer, Data & Eval
United States
$244,000-$305,000 / year
full-timesenior Remotetravel
Tech Stack
Description
In this role, you will set technical direction and lead execution for ML evaluation and the data flywheel powering CSxAI products at Airbnb. You will define how we measure quality, turn feedback into learning signals, and continuously improve models and products safely and efficiently, partnering closely with cross-functional teams.
Requirements
- PhD in Computer Science, Mathematics, Statistics, or related technical field (or equivalent practical experience).
- 10+ years building, testing, and shipping ML/AI systems end-to-end, including 2+ years with GenAI/LLM systems in production.
- 5+ years leading large, ambiguous technical initiatives as a senior IC, influencing roadmap and engineering/science direction across teams.
- Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing).
- Hands-on experience with GenAI systems (orchestration, retrieval, tool calling, memory).
- Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, governance).
- Solid ML fundamentals (model selection, training/serving, monitoring, reliability, model lifecycle management).
Responsibilities
- Define evaluation strategy and success metrics for GenAI systems, aligning offline evaluation with online business and customer experience outcomes.
- Build and scale evaluation frameworks (golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge) with strong controls for bias, drift, and reliability.
- Design the data flywheel: instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance.
- Lead cross-functional quality initiatives across product, ops, and engineering, driving clarity on what “good” looks like.
- Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing.
- Drive technical decisions and architecture for evaluation and data infrastructure.
0 views 0 saves 0 applications