10h ago

Site Reliability Engineer

New York, NY

$160k-$220k / yearest.

full-timesenior Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll design and maintain scalable, highly available infrastructure for our AI platform and customer-facing applications, working closely with software engineers and research teams. Your core impact is ensuring system reliability, performance, and seamless operations across distributed environments. This role offers exposure to cutting-edge AI/ML workloads and HPC clusters.

🎯 What You'll Do

  • Design and maintain scalable, fault-tolerant infrastructures
  • Operate production systems and troubleshoot issues
  • Implement monitoring, alerting, and incident response systems
  • Drive automation in deployment and orchestration

📋 Requirements

  • Master's degree in Computer Science or related field
  • 7+ years of experience in a DevOps/SRE role
  • Strong experience with cloud computing and distributed systems
  • Hands-on experience with CI/CD, Docker, Kubernetes

✨ Nice to Have

  • Experience in an AI/ML environment
  • Experience with HPC systems and Slurm
  • Worked with AI-oriented cloud solutions

📨 Hiring Process

Estimated timeline: 2-4 weeks · AI estimate

  1. 1Recruiter Screen· 30 min
  2. 2Technical Interview· 60 min
  3. 3Onsite Interview· 2 hours
0 0 0