6h ago

ML Infra Engineer

San Francisco

$175k-$265k / yearest.

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll scale and optimize training systems and core model code, owning critical infrastructure for large-scale training. You'll work with researchers and model engineers to translate ideas into experiments and production runs, at the intersection of ML, software engineering, and scalable infrastructure.

🎯 What You'll Do

  • Design and maintain systems for large-scale model training scheduling and job management
  • Scale distributed JAX training across TPU and GPU clusters
  • Profile and optimize memory usage, device utilization, and throughput
  • Build abstractions for launching, monitoring, and debugging experiments

📋 Requirements

  • Strong software engineering fundamentals with experience in JAX or PyTorch
  • Hands-on large-scale training experience
  • Familiarity with distributed training, multi-host setups, and evaluation pipelines
  • Experience managing training workloads on cloud platforms like Kubernetes or SLURM

✨ Nice to Have

  • Deep ML systems background (training compilers, runtime optimization)
  • Experience with GPU/TPU performance tuning
  • Background in robotics or multimodal models
0 0 0