11h ago
Member of Technical Staff - Training Platform
San Francisco, CA
$150k-$300k / year
full-timemidai-ml Visa Sponsor
๐ Tech Stack
+3
๐ผ About This Role
You'll build the hosted training platform that lets users launch fine-tuning runs on managed GPU clusters. You'll design the developer-facing platform and the Kubernetes-based training infrastructure. This role spans full-stack product development and distributed systems at frontier AI scale.
๐ฏ What You'll Do
- Design and operate Kubernetes-based training orchestration across multi-cloud GPU fleets
- Build FastAPI backend services and REST APIs for job submission and monitoring
- Ship product UI in Next.js / React / TypeScript with live run monitoring
- Develop Python control-plane agents that watch pods and keep clusters in sync
๐ Requirements
- 3-5 years of experience in platform or infrastructure engineering
- Strong Kubernetes operations experience with Helm, CRDs, operators
- Strong Python backend development with FastAPI
- Working knowledge of distributed training and GPU hardware
โจ Nice to Have
- Experience with vLLM, SGLang, or TensorRT-LLM
- Familiarity with GCP (GKE, GCS) and infrastructure automation
- Frontend experience with React/Next.js and TypeScript
๐ Benefits & Perks
- ๐ฐ Competitive salary ($150Kโ$300K) with significant equity
- ๐ก Flexible work (remote or San Francisco office)
- ๐ Full visa sponsorship and relocation support
- ๐ Professional development budget for courses and conferences
- ๐ Regular team off-sites and conference attendance
๐จ Hiring Process
Estimated timeline: 1-3 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Final Roundยท 60 min
0 0 0