1d ago

Staff Software Engineer, Node Infra

San Francisco, CA

$320k-$405k / year

full-timelead Hybridai-ml Visa Sponsor

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll own the technical strategy for node lifecycle management at a leading AI company, ensuring reliable infrastructure for frontier AI research. Your work will directly impact the scalability and uptime of clusters powering millions of users. This role offers deep technical ownership and intersection of systems and ML.

๐ŸŽฏ What You'll Do

  • Own technical strategy for node lifecycle management
  • Design and operate automated hardware health detection and remediation
  • Drive cross-team initiatives to scale AI clusters across cloud providers
  • Set infrastructure architecture and establish operational excellence practices

๐Ÿ“‹ Requirements

  • Deep expertise in distributed systems and cloud platforms (Kubernetes, AWS/GCP/Azure)
  • Strong proficiency in a systems language like Rust, Go, or Python
  • Hands-on experience with ML accelerators (GPUs, TPUs, Trainium)
  • Track record leading complex, multi-quarter technical initiatives

โœจ Nice to Have

  • 8+ years of software engineering experience as a technical lead
  • Experience managing 10K+ node infrastructure
  • Low-level systems experience: kernel, virtualization, device drivers

๐ŸŽ Benefits & Perks

  • ๐Ÿ–๏ธ Flexible PTO
  • ๐Ÿฅ Comprehensive health insurance
  • ๐Ÿ’ฐ Competitive equity package
  • ๐Ÿ“š Learning and development budget
  • ๐Ÿฝ๏ธ Daily meals and snacks

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Screenยท 30 min
  2. 2Technical Phone Screenยท 45 min
  3. 3On-site Interviewsยท 4 hours
0 0 0