10h ago

Senior Site Reliability Engineer

San Francisco, CA

$170k-$230k / yearest.

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll ensure reliability and performance of Hyperbolic's GPU marketplace and AI infrastructure. You'll define SLOs, build incident response systems, and manage capacity across a distributed GPU network. This is a high-impact role directly influencing affordable, accessible AI compute at scale.

🎯 What You'll Do

  • Define and maintain SLOs for job success rates
  • Design monitoring and alerting systems for infrastructure visibility
  • Build automation for capacity management and resource allocation
  • Lead incident response and post-mortem processes

📋 Requirements

  • Expert in SRE with proven experience defining and maintaining SLOs
  • Strong background in capacity planning for distributed systems
  • Deep knowledge of deployment systems including canary and rollback
  • Proficient in observability tools like Prometheus, Grafana, ELK

✨ Nice to Have

  • Experience operating GPU infrastructure or AI platforms
  • Background in distributed systems or decentralized infrastructure
  • Experience with chaos engineering and resilience testing

🎁 Benefits & Perks

  • 💻 Remote-friendly with flexible hours
  • 🏥 Health insurance coverage
  • 📈 Equity in a growing startup
0 0 0