Platform Support Engineer at Jobs at Lightning AI

9h ago

Platform Support Engineer

Remote

✨ $90k-$130k / yearest.

full-timesenior Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll work directly with ML engineers debugging complex distributed systems and GPU infrastructure issues. Your core impact is improving reliability for large-scale AI workloads across Kubernetes and cloud platforms. This role offers exposure to cutting-edge AI infrastructure and high-impact incident response.

🎯 What You'll Do

Diagnose distributed training and inference failures
Troubleshoot Kubernetes, GPU orchestration, and networking issues
Analyze logs, metrics, and traces to isolate root causes
Build internal tooling, automation, and runbooks

📋 Requirements

Kubernetes and containerized environments experience
PyTorch, CUDA, or NCCL hands-on experience
Strong Linux systems and networking knowledge
Experience with observability tools like Prometheus/Grafana

✨ Nice to Have

Experience with Ray, Kubeflow, or Slurm
Familiarity with InfiniBand or high-performance networking
Background in AI infrastructure or MLOps companies

🎁 Benefits & Perks

🏖️ Unlimited PTO
🏠 Remote-first culture
📈 Equity options
💻 Home office stipend
🧘 Wellness benefits

📨 Hiring Process

Estimated timeline: 2-3 weeks · AI estimate

1Recruiter call· 30 min
2Technical interview· 60 min
3Hiring manager interview· 45 min

Jobs at Lightning AI

Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction. We recently merged with Voltage Park to create the first cloud built for AI! We operate globally in New York City, San Francisco, Seattle, and London, and…

Other jobs at Jobs at Lightning AI

No other jobs found.

0 0 0