9h ago
Platform Support Engineer
Remote
โจ $90k-$130k / yearest.
full-timesenior Remoteai-ml
๐ Tech Stack
๐ผ About This Role
You'll work directly with ML engineers debugging complex distributed systems and GPU infrastructure issues. Your core impact is improving reliability for large-scale AI workloads across Kubernetes and cloud platforms. This role offers exposure to cutting-edge AI infrastructure and high-impact incident response.
๐ฏ What You'll Do
- Diagnose distributed training and inference failures
- Troubleshoot Kubernetes, GPU orchestration, and networking issues
- Analyze logs, metrics, and traces to isolate root causes
- Build internal tooling, automation, and runbooks
๐ Requirements
- Kubernetes and containerized environments experience
- PyTorch, CUDA, or NCCL hands-on experience
- Strong Linux systems and networking knowledge
- Experience with observability tools like Prometheus/Grafana
โจ Nice to Have
- Experience with Ray, Kubeflow, or Slurm
- Familiarity with InfiniBand or high-performance networking
- Background in AI infrastructure or MLOps companies
๐ Benefits & Perks
- ๐๏ธ Unlimited PTO
- ๐ Remote-first culture
- ๐ Equity options
- ๐ป Home office stipend
- ๐ง Wellness benefits
๐จ Hiring Process
Estimated timeline: 2-3 weeks ยท AI estimate
- 1Recruiter callยท 30 min
- 2Technical interviewยท 60 min
- 3Hiring manager interviewยท 45 min
0 0 0