11h ago
Lead Systems HPC Engineer
Remote - United States
$170k-$300k / year
full-timelead Remoteai-ml
๐ Tech Stack
๐ผ About This Role
You'll build and optimize hyperscaler GPU clusters at Nebius, focusing on performance across the full stack. Your core impact will be identifying bottlenecks and driving improvements for large-scale AI workloads. This role offers a chance to work with cutting-edge hardware and collaborate with leading vendors like NVIDIA.
๐ฏ What You'll Do
- Analyze system behavior and identify performance bottlenecks across layers.
- Investigate and troubleshoot GPU cluster performance under real workloads.
- Evaluate and integrate new hardware and system configurations.
- Contribute to hardware and cluster qualification and acceptance testing.
๐ Requirements
- 5+ years in system-level software development focused on performance optimization.
- 3+ years hands-on experience with Linux administration and performance tuning.
- In-depth understanding of server architecture, PCIe, NICs, and HPC systems.
- Strong proficiency in C/C++, Go, or Python with coding interviews.
โจ Nice to Have
- Experience with InfiniBand/RoCE networking.
- Familiarity with distributed communication layers like MPI and NCCL.
- Previous work with large-scale GPU clusters.
๐ Benefits & Perks
- ๐ฅ 100% company-paid medical, dental, and vision for employees and families.
- ๐ฐ 401(k) plan with up to 4% company match and immediate vesting.
- ๐ถ 20 weeks paid parental leave for primary caregivers, 12 for secondary.
- ๐ป $85/month remote work reimbursement for mobile and internet.
- ๐ก๏ธ Company-paid short-term, long-term, and life insurance.
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Callยท 30 min
- 2Technical Interviewยท 60 min
- 3Coding Interviewยท 60 min
0 0 0