2d ago
DevOps / Site Reliability Engineer
Remote
โจ $140k-$180k / yearest.
full-timemid Remoteai-ml
๐ Tech Stack
+1
๐ผ About This Role
You'll own and scale cloud infrastructure on AWS, managing Kubernetes clusters and CI/CD pipelines. You'll ensure reliability and performance of production systems supporting AI data pipelines. This role offers direct impact on frontier AI model training infrastructure.
๐ฏ What You'll Do
- Own cloud infrastructure on AWS (EC2, EKS, RDS, S3, IAM, VPC)
- Manage Kubernetes clusters and container orchestration
- Build and maintain CI/CD pipelines using GitHub Actions
- Implement monitoring, alerting, and observability stacks
- Automate infrastructure with Terraform or similar IaC tools
๐ Requirements
- 3โ5 years in DevOps, SRE, or infrastructure engineering
- Strong AWS experience โ EKS, EC2, RDS, S3, IAM
- Kubernetes โ deployment, scaling, troubleshooting in production
- CI/CD pipelines โ GitHub Actions, ArgoCD, or similar
โจ Nice to Have
- Experience supporting ML training workloads or GPU clusters
- Familiarity with distributed computing or large-scale data pipelines
- Open-source contributions or published technical writing
๐ Benefits & Perks
- ๐ฐ Competitive compensation and meaningful equity
- ๐ Remote-friendly environment with low bureaucracy
- ๐ Direct impact on frontier AI model training and evaluation
- ๐ง Small, high-caliber team with deep AI research expertise
- ๐ Health, wellness, and learning & development benefits
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Hiring Managerยท 45 min
0 0 0