11h ago

Lead Systems HPC Engineer

Remote - United States

$170k-$300k / year

full-timelead Remoteai-ml

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll build and optimize hyperscaler GPU clusters at Nebius, focusing on performance across the full stack. Your core impact will be identifying bottlenecks and driving improvements for large-scale AI workloads. This role offers a chance to work with cutting-edge hardware and collaborate with leading vendors like NVIDIA.

๐ŸŽฏ What You'll Do

  • Analyze system behavior and identify performance bottlenecks across layers.
  • Investigate and troubleshoot GPU cluster performance under real workloads.
  • Evaluate and integrate new hardware and system configurations.
  • Contribute to hardware and cluster qualification and acceptance testing.

๐Ÿ“‹ Requirements

  • 5+ years in system-level software development focused on performance optimization.
  • 3+ years hands-on experience with Linux administration and performance tuning.
  • In-depth understanding of server architecture, PCIe, NICs, and HPC systems.
  • Strong proficiency in C/C++, Go, or Python with coding interviews.

โœจ Nice to Have

  • Experience with InfiniBand/RoCE networking.
  • Familiarity with distributed communication layers like MPI and NCCL.
  • Previous work with large-scale GPU clusters.

๐ŸŽ Benefits & Perks

  • ๐Ÿฅ 100% company-paid medical, dental, and vision for employees and families.
  • ๐Ÿ’ฐ 401(k) plan with up to 4% company match and immediate vesting.
  • ๐Ÿ‘ถ 20 weeks paid parental leave for primary caregivers, 12 for secondary.
  • ๐Ÿ’ป $85/month remote work reimbursement for mobile and internet.
  • ๐Ÿ›ก๏ธ Company-paid short-term, long-term, and life insurance.

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Callยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Coding Interviewยท 60 min
0 0 0