5h ago

ML Infra Engineer

San Francisco

$220k-$350k / yearest.

full-timeai-ml

🛠 Tech Stack

💼 About This Role

You'll help scale and optimize training systems and core model code, owning critical infrastructure for large-scale training. You'll work closely with researchers to translate ideas into experiments and production training runs. This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.

🎯 What You'll Do

  • Own training/inference infrastructure: scheduling, job management, checkpointing, metrics/logging.
  • Scale distributed training across TPU and GPU clusters with JAX.
  • Profile and optimize memory usage, device utilization, and throughput.
  • Build abstractions for launching, monitoring, debugging, and reproducing experiments.

📋 Requirements

  • Strong software engineering fundamentals and experience building ML training infrastructure.
  • Hands-on large-scale training experience in JAX (preferred) or PyTorch.
  • Experience managing training workloads on cloud platforms (e.g., Kubernetes, GCP TPU/GKE, AWS).
  • Ability to debug and optimize performance bottlenecks across the training stack.

✨ Nice to Have

  • Deep ML systems background (training compilers, runtime optimization, custom kernels).
  • Experience operating close to hardware (GPU/TPU performance tuning).
  • Background in robotics, multimodal models, or large-scale foundation models.

🎁 Benefits & Perks

  • 🏖️ Unlimited PTO
  • 💰 Competitive salary and equity
  • 🏥 Health, dental, and vision insurance
  • 🍱 Daily meals and snacks
  • 🚀 Opportunity to work on cutting-edge AI
0 0 0