2h ago

Network Engineer - AI/HPC

Memphis, TN

$175k-$250k / yearest.

full-timeleadArtificial Intelligence

🛠 Tech Stack

💼 About This Role

You'll design and optimize the backend and front-end networks for xAI's hyper-scale GPU clusters, focusing on RoCEv2 as part of a small, fast-paced team. Your work will directly impact training and inference performance, ensuring no performance is left on the table. This role offers the chance to work on AI training and inference workloads and to develop at hyper scale with the latest hardware.

🎯 What You'll Do

  • Deep dive into NCCL, building metric dashboards and tweaking configurations.
  • Design next iteration of backend and front-end networks for GPU infrastructure.
  • Travel to Memphis for building capacity and participate in on-call rotation.
  • Contribute to deployment and operations frameworks to remove repetitive tasks.

📋 Requirements

  • 10+ years designing and operating large scale networks.
  • 5 years in the ethernet AI/HPC space.
  • Deep understanding of congestion control on ethernet.
  • Experience with Python to automate tasks and analyze large data sets.

✨ Nice to Have

  • Infiniband experience is a bonus.
  • Ability to use and debug NCCL.
  • Expertise in creating metrics for performance and operations.
0 0 0