16h ago

Machine Learning Infrastructure Engineer

Redwood City, CA

$150k-$350k / year

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll design, build, and maintain training and serving infrastructure for ML research at a fast-growing AI company. You'll maximize GPU utilization and build tooling to diagnose cluster issues. Over 20 million users interact with our characters monthly.

🎯 What You'll Do

  • Provide infrastructure support to ML research and product
  • Build tooling to diagnose cluster issues and hardware failures
  • Monitor deployments, manage experiments, and support research
  • Maximize GPU allocation and utilization for serving and training

📋 Requirements

  • 4+ years experience supporting ML infrastructure
  • Experience developing tools to diagnose ML infrastructure problems
  • Experience with cloud platforms like Compute Engine, Kubernetes, Cloud Storage
  • Experience working with GPUs

✨ Nice to Have

  • Experience with large GPU clusters and HPC/networking
  • Experience supporting large language model training
  • Experience with ML frameworks like PyTorch/TensorFlow/JAX
0 0 0