10h ago
Site Reliability Engineer
New York, NY
✨ $160k-$220k / yearest.
full-timesenior Remoteai-ml
🛠 Tech Stack
💼 About This Role
You'll design and maintain scalable, highly available infrastructure for our AI platform and customer-facing applications, working closely with software engineers and research teams. Your core impact is ensuring system reliability, performance, and seamless operations across distributed environments. This role offers exposure to cutting-edge AI/ML workloads and HPC clusters.
🎯 What You'll Do
- Design and maintain scalable, fault-tolerant infrastructures
- Operate production systems and troubleshoot issues
- Implement monitoring, alerting, and incident response systems
- Drive automation in deployment and orchestration
📋 Requirements
- Master's degree in Computer Science or related field
- 7+ years of experience in a DevOps/SRE role
- Strong experience with cloud computing and distributed systems
- Hands-on experience with CI/CD, Docker, Kubernetes
✨ Nice to Have
- Experience in an AI/ML environment
- Experience with HPC systems and Slurm
- Worked with AI-oriented cloud solutions
📨 Hiring Process
Estimated timeline: 2-4 weeks · AI estimate
- 1Recruiter Screen· 30 min
- 2Technical Interview· 60 min
- 3Onsite Interview· 2 hours
0 0 0