2d ago

DevOps / Site Reliability Engineer

Remote

โœจ $140k-$180k / yearest.

full-timemid Remoteai-ml

๐Ÿ›  Tech Stack

+1

๐Ÿ’ผ About This Role

You'll own and scale cloud infrastructure on AWS, managing Kubernetes clusters and CI/CD pipelines. You'll ensure reliability and performance of production systems supporting AI data pipelines. This role offers direct impact on frontier AI model training infrastructure.

๐ŸŽฏ What You'll Do

  • Own cloud infrastructure on AWS (EC2, EKS, RDS, S3, IAM, VPC)
  • Manage Kubernetes clusters and container orchestration
  • Build and maintain CI/CD pipelines using GitHub Actions
  • Implement monitoring, alerting, and observability stacks
  • Automate infrastructure with Terraform or similar IaC tools

๐Ÿ“‹ Requirements

  • 3โ€“5 years in DevOps, SRE, or infrastructure engineering
  • Strong AWS experience โ€” EKS, EC2, RDS, S3, IAM
  • Kubernetes โ€” deployment, scaling, troubleshooting in production
  • CI/CD pipelines โ€” GitHub Actions, ArgoCD, or similar

โœจ Nice to Have

  • Experience supporting ML training workloads or GPU clusters
  • Familiarity with distributed computing or large-scale data pipelines
  • Open-source contributions or published technical writing

๐ŸŽ Benefits & Perks

  • ๐Ÿ’ฐ Competitive compensation and meaningful equity
  • ๐ŸŒ Remote-friendly environment with low bureaucracy
  • ๐Ÿ† Direct impact on frontier AI model training and evaluation
  • ๐Ÿง  Small, high-caliber team with deep AI research expertise
  • ๐Ÿ“š Health, wellness, and learning & development benefits

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Screenยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Hiring Managerยท 45 min
0 0 0