10h ago
Machine Learning Infra Engineer
San Francisco, CA
$150k-$300k / year
full-timeai-ml
๐ Tech Stack
๐ผ About This Role
You'll build the inference and training frameworks for an AI document processing platform, directly impacting how enterprises extract data from complex PDFs. You'll scale distributed ML training across multi-node GPU clusters and create observability across the stack.
๐ฏ What You'll Do
- Build and maintain training and inference stack for fast iteration
- Design systems for scaling training across multi-node, multi-GPU environments
- Scale distributed training and inference across large GPU clusters
- Develop benchmarks to identify bottlenecks in training and inference stacks
๐ Requirements
- Strong Python skills and systems engineering background
- Experience with Kubernetes and distributed training frameworks
- Ability to solve complex problems from first principles
โจ Nice to Have
- Experience at an early-stage or high-growth startup
- Contributions to open source training/inference stacks
- Excitement for distributed inference across hundreds to thousands of GPUs
๐ Benefits & Perks
- ๐๏ธ Unlimited PTO
- ๐ฝ๏ธ Free daily lunch at office
- ๐ Reimbursed transportation costs
- ๐ฅ Generous health insurance covering medical, dental, vision
- ๐ช $150/month health and wellness budget
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter screenยท 30 min
- 2Technical interviewยท 60 min
- 3Onsite interviewsยท 3 hours
0 0 0