2h ago

ML Model Serving Engineer

San Francisco

$175k-$280k / year

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll turbocharge our serving layer of LLM, speech, and vision models, partnering with ML infrastructure and training teams to build a fast, cost-effective serving system. You'll modify frameworks like VLLM and SGLang to implement cutting-edge optimization techniques.

🎯 What You'll Do

  • Optimize serving layer for LLM, speech, and vision models
  • Modify and extend LLM serving frameworks like VLLM and SGLang
  • Use in-flight batching, caching, and custom kernels to speed inference
  • Reduce model initialization times while maintaining quality

📋 Requirements

  • Expert in PyTorch or similar differentiable array computing framework
  • Expert in optimizing ML models for serving with high throughput and low latency
  • Significant systems programming experience with high-performance server systems
  • Significant performance engineering experience in bottleneck analysis

✨ Nice to Have

  • Familiarity with VLLM or SGLang internals and deployment
  • Experience with GCP, AWS, or Azure cloud platform
  • Experience deploying inference workloads with Kubernetes or Ray

🎁 Benefits & Perks

  • 💰 401(k) match up to 3.5% of compensation
  • 🏥 100% employer-paid health, vision, dental for you and dependents
  • 🏖️ Unlimited PTO and sick time
  • 💵 Flexible spending account with employer match up to $1,650/year
  • 📈 Competitive stock options

📨 Hiring Process

[email protected]

0 0 0