Senior Site Reliability Engineer at Hyperbolic Labs

10h ago

Senior Site Reliability Engineer

San Francisco, CA

✨ $170k-$230k / yearest.

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll ensure reliability and performance of Hyperbolic's GPU marketplace and AI infrastructure. You'll define SLOs, build incident response systems, and manage capacity across a distributed GPU network. This is a high-impact role directly influencing affordable, accessible AI compute at scale.

🎯 What You'll Do

Define and maintain SLOs for job success rates
Design monitoring and alerting systems for infrastructure visibility
Build automation for capacity management and resource allocation
Lead incident response and post-mortem processes

📋 Requirements

Expert in SRE with proven experience defining and maintaining SLOs
Strong background in capacity planning for distributed systems
Deep knowledge of deployment systems including canary and rollback
Proficient in observability tools like Prometheus, Grafana, ELK

✨ Nice to Have

Experience operating GPU infrastructure or AI platforms
Background in distributed systems or decentralized infrastructure
Experience with chaos engineering and resilience testing

🎁 Benefits & Perks

💻 Remote-friendly with flexible hours
🏥 Health insurance coverage
📈 Equity in a growing startup

Hyperbolic Labs

Hyperbolic Labs Jobs

Other jobs at Hyperbolic Labs

No other jobs found.

0 0 0