10h ago

Machine Learning Infra Engineer

San Francisco, CA

$150k-$300k / year

full-timeai-ml

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll build the inference and training frameworks for an AI document processing platform, directly impacting how enterprises extract data from complex PDFs. You'll scale distributed ML training across multi-node GPU clusters and create observability across the stack.

๐ŸŽฏ What You'll Do

  • Build and maintain training and inference stack for fast iteration
  • Design systems for scaling training across multi-node, multi-GPU environments
  • Scale distributed training and inference across large GPU clusters
  • Develop benchmarks to identify bottlenecks in training and inference stacks

๐Ÿ“‹ Requirements

  • Strong Python skills and systems engineering background
  • Experience with Kubernetes and distributed training frameworks
  • Ability to solve complex problems from first principles

โœจ Nice to Have

  • Experience at an early-stage or high-growth startup
  • Contributions to open source training/inference stacks
  • Excitement for distributed inference across hundreds to thousands of GPUs

๐ŸŽ Benefits & Perks

  • ๐Ÿ–๏ธ Unlimited PTO
  • ๐Ÿฝ๏ธ Free daily lunch at office
  • ๐Ÿš— Reimbursed transportation costs
  • ๐Ÿฅ Generous health insurance covering medical, dental, vision
  • ๐Ÿ’ช $150/month health and wellness budget

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter screenยท 30 min
  2. 2Technical interviewยท 60 min
  3. 3Onsite interviewsยท 3 hours
0 0 0