2h ago
ML Model Serving Engineer
San Francisco
$175k-$280k / year
full-timeseniorai-ml
🛠 Tech Stack
💼 About This Role
You'll turbocharge our serving layer of LLM, speech, and vision models, partnering with ML infrastructure and training teams to build a fast, cost-effective serving system. You'll modify frameworks like VLLM and SGLang to implement cutting-edge optimization techniques.
🎯 What You'll Do
- Optimize serving layer for LLM, speech, and vision models
- Modify and extend LLM serving frameworks like VLLM and SGLang
- Use in-flight batching, caching, and custom kernels to speed inference
- Reduce model initialization times while maintaining quality
📋 Requirements
- Expert in PyTorch or similar differentiable array computing framework
- Expert in optimizing ML models for serving with high throughput and low latency
- Significant systems programming experience with high-performance server systems
- Significant performance engineering experience in bottleneck analysis
✨ Nice to Have
- Familiarity with VLLM or SGLang internals and deployment
- Experience with GCP, AWS, or Azure cloud platform
- Experience deploying inference workloads with Kubernetes or Ray
🎁 Benefits & Perks
- 💰 401(k) match up to 3.5% of compensation
- 🏥 100% employer-paid health, vision, dental for you and dependents
- 🏖️ Unlimited PTO and sick time
- 💵 Flexible spending account with employer match up to $1,650/year
- 📈 Competitive stock options
📨 Hiring Process
0 0 0