ML Infra Engineer at Physical Intelligence

5h ago

ML Infra Engineer

San Francisco

✨ $220k-$350k / yearest.

full-timeai-ml

🛠 Tech Stack

💼 About This Role

You'll help scale and optimize training systems and core model code, owning critical infrastructure for large-scale training. You'll work closely with researchers to translate ideas into experiments and production training runs. This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.

🎯 What You'll Do

Own training/inference infrastructure: scheduling, job management, checkpointing, metrics/logging.
Scale distributed training across TPU and GPU clusters with JAX.
Profile and optimize memory usage, device utilization, and throughput.
Build abstractions for launching, monitoring, debugging, and reproducing experiments.

📋 Requirements

Strong software engineering fundamentals and experience building ML training infrastructure.
Hands-on large-scale training experience in JAX (preferred) or PyTorch.
Experience managing training workloads on cloud platforms (e.g., Kubernetes, GCP TPU/GKE, AWS).
Ability to debug and optimize performance bottlenecks across the training stack.

✨ Nice to Have

Deep ML systems background (training compilers, runtime optimization, custom kernels).
Experience operating close to hardware (GPU/TPU performance tuning).
Background in robotics, multimodal models, or large-scale foundation models.

🎁 Benefits & Perks

🏖️ Unlimited PTO
💰 Competitive salary and equity
🏥 Health, dental, and vision insurance
🍱 Daily meals and snacks
🚀 Opportunity to work on cutting-edge AI

Physical Intelligence

Physical Intelligence Jobs

Other jobs at Physical Intelligence

No other jobs found.

0 0 0