4h ago

Staff Site Reliability Engineer

Remote Argentina

$200k-$250k / yearest.

full-timesenior Remotesoftware

🛠 Tech Stack

💼 About This Role

You'll lead the development of AI-assisted reliability tooling and own incident response for a leading AI platform. Your work will directly shape how SRE is practiced at Domino, reducing toil and improving system reliability for enterprise customers like Johnson & Johnson and NVIDIA. You'll sit at the center of operations and engineering, building tools that make outages shorter and rarer.

🎯 What You'll Do

  • Lead development of AI-assisted reliability tooling analyzing logs, traces, and tickets.
  • Improve observability coverage and signal quality for critical systems.
  • Own incident response end-to-end, from detection to remediation.
  • Define and mature SLO/SLI frameworks for priority services.
  • Scale cloud operations for single-tenant SaaS deployments.

📋 Requirements

  • Deep experience in Site Reliability Engineering or platform engineering with operational ownership.
  • Fluency with Kubernetes, Linux, cloud platforms, and observability tooling.
  • Strong Python or Go skills with a track record of building internal tools.
  • Comfort leading technically ambiguous work and influencing across teams.

✨ Nice to Have

  • Experience with LLM-based systems and retrieval workflows.
  • Background in SaaS platform operations or building tooling for support teams.

🎁 Benefits & Perks

  • 🏖️ Remote-first culture
  • 💰 Competitive equity with leading investors like Sequoia and Nvidia
  • 📚 Growth mindset environment with emphasis on teaching and learning

📨 Hiring Process

Estimated timeline: 2-4 weeks · AI estimate

  1. 1Recruiter Screen· 30 min
  2. 2Technical Interview· 60 min
  3. 3Hiring Manager· 45 min
0 0 0