2d ago

ML Systems Engineer

Menlo Park

$300k-$400k / year

full-timeleadai-ml Visa Sponsor

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll build the systems layer that makes frontier model training and inference fast and tightly coupled to the RL feedback loop for scientific discovery. You'll go deep into scheduling, kernels, RDMA, and weight synchronization while working with researchers to co-design algorithms and infrastructure. The speed of the RL loop directly multiplies the pace of discovery.

๐ŸŽฏ What You'll Do

  • Build rack and topology-aware scheduling for GPUs across Ray, Slurm, and Kubernetes
  • Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks
  • Write and optimize communication and GPU kernels for maximum throughput
  • Design zero-copy RDMA weight synchronization between training and inference

๐Ÿ“‹ Requirements

  • Experience with large-scale inference infrastructure at production scale
  • Low-level systems programming with RDMA, NVLink, and kernel-level work
  • GPU cluster scheduling across Ray, Slurm, or Kubernetes
  • Writing and optimizing CUDA kernels for distributed training

โœจ Nice to Have

  • Contributions to open source ML infrastructure projects like SGLang, Megatron-LM, vLLM
  • Experience working directly with ML researchers on algorithm-infrastructure co-design

๐ŸŽ Benefits & Perks

  • ๐Ÿ’ฐ Competitive compensation: $300k-$400k range
  • ๐Ÿฅ Health benefits (implied by startup environment)
  • ๐Ÿ—ฝ Visa sponsorship available
  • ๐Ÿš€ Work at a cutting-edge AI company backed by top investors

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Phone Screenยท 30 min
  2. 2Technical Screenยท 45 min
  3. 3On-site Interviewsยท 4-5 hours
0 0 0