2d ago
ML Systems Engineer
Menlo Park
$300k-$400k / year
full-timeleadai-ml Visa Sponsor
๐ Tech Stack
๐ผ About This Role
You'll build the systems layer that makes frontier model training and inference fast and tightly coupled to the RL feedback loop for scientific discovery. You'll go deep into scheduling, kernels, RDMA, and weight synchronization while working with researchers to co-design algorithms and infrastructure. The speed of the RL loop directly multiplies the pace of discovery.
๐ฏ What You'll Do
- Build rack and topology-aware scheduling for GPUs across Ray, Slurm, and Kubernetes
- Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks
- Write and optimize communication and GPU kernels for maximum throughput
- Design zero-copy RDMA weight synchronization between training and inference
๐ Requirements
- Experience with large-scale inference infrastructure at production scale
- Low-level systems programming with RDMA, NVLink, and kernel-level work
- GPU cluster scheduling across Ray, Slurm, or Kubernetes
- Writing and optimizing CUDA kernels for distributed training
โจ Nice to Have
- Contributions to open source ML infrastructure projects like SGLang, Megatron-LM, vLLM
- Experience working directly with ML researchers on algorithm-infrastructure co-design
๐ Benefits & Perks
- ๐ฐ Competitive compensation: $300k-$400k range
- ๐ฅ Health benefits (implied by startup environment)
- ๐ฝ Visa sponsorship available
- ๐ Work at a cutting-edge AI company backed by top investors
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Phone Screenยท 30 min
- 2Technical Screenยท 45 min
- 3On-site Interviewsยท 4-5 hours
0 0 0