Senior Staff Machine Learning Engineer, Data & Eval at Positions Archive

about 5 hours ago

Senior Staff Machine Learning Engineer, Data & Eval

United States

$244,000-$305,000 / year

full-timesenior Remotetravel

Tech Stack

Description

In this role, you will set technical direction and lead execution for ML evaluation and the data flywheel powering CSxAI products at Airbnb. You will define how we measure quality, turn feedback into learning signals, and continuously improve models and products safely and efficiently, partnering closely with cross-functional teams.

Requirements

PhD in Computer Science, Mathematics, Statistics, or related technical field (or equivalent practical experience).
10+ years building, testing, and shipping ML/AI systems end-to-end, including 2+ years with GenAI/LLM systems in production.
5+ years leading large, ambiguous technical initiatives as a senior IC, influencing roadmap and engineering/science direction across teams.
Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing).
Hands-on experience with GenAI systems (orchestration, retrieval, tool calling, memory).
Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, governance).
Solid ML fundamentals (model selection, training/serving, monitoring, reliability, model lifecycle management).

Responsibilities

Define evaluation strategy and success metrics for GenAI systems, aligning offline evaluation with online business and customer experience outcomes.
Build and scale evaluation frameworks (golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge) with strong controls for bias, drift, and reliability.
Design the data flywheel: instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance.
Lead cross-functional quality initiatives across product, ops, and engineering, driving clarity on what “good” looks like.
Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing.
Drive technical decisions and architecture for evaluation and data infrastructure.

Positions Archive

Other jobs at Positions Archive

No other jobs found.

0 views 0 saves 0 applications