9h ago

Platform Support Engineer

Remote

โœจ $90k-$130k / yearest.

full-timesenior Remoteai-ml

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll work directly with ML engineers debugging complex distributed systems and GPU infrastructure issues. Your core impact is improving reliability for large-scale AI workloads across Kubernetes and cloud platforms. This role offers exposure to cutting-edge AI infrastructure and high-impact incident response.

๐ŸŽฏ What You'll Do

  • Diagnose distributed training and inference failures
  • Troubleshoot Kubernetes, GPU orchestration, and networking issues
  • Analyze logs, metrics, and traces to isolate root causes
  • Build internal tooling, automation, and runbooks

๐Ÿ“‹ Requirements

  • Kubernetes and containerized environments experience
  • PyTorch, CUDA, or NCCL hands-on experience
  • Strong Linux systems and networking knowledge
  • Experience with observability tools like Prometheus/Grafana

โœจ Nice to Have

  • Experience with Ray, Kubeflow, or Slurm
  • Familiarity with InfiniBand or high-performance networking
  • Background in AI infrastructure or MLOps companies

๐ŸŽ Benefits & Perks

  • ๐Ÿ–๏ธ Unlimited PTO
  • ๐Ÿ  Remote-first culture
  • ๐Ÿ“ˆ Equity options
  • ๐Ÿ’ป Home office stipend
  • ๐Ÿง˜ Wellness benefits

๐Ÿ“จ Hiring Process

Estimated timeline: 2-3 weeks ยท AI estimate

  1. 1Recruiter callยท 30 min
  2. 2Technical interviewยท 60 min
  3. 3Hiring manager interviewยท 45 min
0 0 0