4h ago
Staff Site Reliability Engineer
Remote Argentina
✨ $200k-$250k / yearest.
full-timesenior Remotesoftware
🛠 Tech Stack
💼 About This Role
You'll lead the development of AI-assisted reliability tooling and own incident response for a leading AI platform. Your work will directly shape how SRE is practiced at Domino, reducing toil and improving system reliability for enterprise customers like Johnson & Johnson and NVIDIA. You'll sit at the center of operations and engineering, building tools that make outages shorter and rarer.
🎯 What You'll Do
- Lead development of AI-assisted reliability tooling analyzing logs, traces, and tickets.
- Improve observability coverage and signal quality for critical systems.
- Own incident response end-to-end, from detection to remediation.
- Define and mature SLO/SLI frameworks for priority services.
- Scale cloud operations for single-tenant SaaS deployments.
📋 Requirements
- Deep experience in Site Reliability Engineering or platform engineering with operational ownership.
- Fluency with Kubernetes, Linux, cloud platforms, and observability tooling.
- Strong Python or Go skills with a track record of building internal tools.
- Comfort leading technically ambiguous work and influencing across teams.
✨ Nice to Have
- Experience with LLM-based systems and retrieval workflows.
- Background in SaaS platform operations or building tooling for support teams.
🎁 Benefits & Perks
- 🏖️ Remote-first culture
- 💰 Competitive equity with leading investors like Sequoia and Nvidia
- 📚 Growth mindset environment with emphasis on teaching and learning
📨 Hiring Process
Estimated timeline: 2-4 weeks · AI estimate
- 1Recruiter Screen· 30 min
- 2Technical Interview· 60 min
- 3Hiring Manager· 45 min
0 0 0