10h ago
Senior Site Reliability Engineer
San Francisco, CA
✨ $170k-$230k / yearest.
full-timeseniorai-ml
🛠 Tech Stack
💼 About This Role
You'll ensure reliability and performance of Hyperbolic's GPU marketplace and AI infrastructure. You'll define SLOs, build incident response systems, and manage capacity across a distributed GPU network. This is a high-impact role directly influencing affordable, accessible AI compute at scale.
🎯 What You'll Do
- Define and maintain SLOs for job success rates
- Design monitoring and alerting systems for infrastructure visibility
- Build automation for capacity management and resource allocation
- Lead incident response and post-mortem processes
📋 Requirements
- Expert in SRE with proven experience defining and maintaining SLOs
- Strong background in capacity planning for distributed systems
- Deep knowledge of deployment systems including canary and rollback
- Proficient in observability tools like Prometheus, Grafana, ELK
✨ Nice to Have
- Experience operating GPU infrastructure or AI platforms
- Background in distributed systems or decentralized infrastructure
- Experience with chaos engineering and resilience testing
🎁 Benefits & Perks
- 💻 Remote-friendly with flexible hours
- 🏥 Health insurance coverage
- 📈 Equity in a growing startup
0 0 0