8h ago
Site Reliability Engineer
Arlington, VA
$100.2k-$203.4k / year
full-timegovernment
๐ Tech Stack
๐ผ About This Role
You'll ensure reliability, scalability, and monitoring of enterprise AI systems within a Hub-and-Spoke architecture at Accenture Federal Services. You'll lead incident response, performance optimization, and capacity planning for mission-critical applications and AI governance.
๐ฏ What You'll Do
- Ensure reliability, scalability, and performance of enterprise AI systems
- Lead incident response efforts to minimize downtime
- Implement SLOs/SLAs, capacity planning, and performance optimization
- Operate observability platforms using OpenTelemetry, Prometheus, Grafana, Loki, Tempo
- Drive FinOps practices to optimize operational costs
๐ Requirements
- Experience with OpenTelemetry, Prometheus, Grafana, Loki, and Tempo
- Hands-on experience with SLO/SLA management, FinOps practices, and advanced monitoring
- Experience with reliability engineering, incident response, and FinOps
- Must be a U.S. citizen with an active TS/SCI clearance
โจ Nice to Have
- Exposure to complex integration efforts and continuous delivery pipelines
- Experience in mission-focused operational environments
๐ Benefits & Perks
- ๐๏ธ Unlimited PTO
- ๐ฅ Health insurance
- ๐ Certification reimbursement
- ๐ฐ 401(k) matching
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Callยท 30 min
- 2Technical Interviewยท 60 min
- 3Hiring Manager Interviewยท 45 min
This description was AI-summarized. View original
0 0 0