8h ago

Site Reliability Engineer

Arlington, VA

$100.2k-$203.4k / year

full-timegovernment

๐Ÿ›  Tech Stack

๐Ÿ’ผ About This Role

You'll ensure reliability, scalability, and monitoring of enterprise AI systems within a Hub-and-Spoke architecture at Accenture Federal Services. You'll lead incident response, performance optimization, and capacity planning for mission-critical applications and AI governance.

๐ŸŽฏ What You'll Do

  • Ensure reliability, scalability, and performance of enterprise AI systems
  • Lead incident response efforts to minimize downtime
  • Implement SLOs/SLAs, capacity planning, and performance optimization
  • Operate observability platforms using OpenTelemetry, Prometheus, Grafana, Loki, Tempo
  • Drive FinOps practices to optimize operational costs

๐Ÿ“‹ Requirements

  • Experience with OpenTelemetry, Prometheus, Grafana, Loki, and Tempo
  • Hands-on experience with SLO/SLA management, FinOps practices, and advanced monitoring
  • Experience with reliability engineering, incident response, and FinOps
  • Must be a U.S. citizen with an active TS/SCI clearance

โœจ Nice to Have

  • Exposure to complex integration efforts and continuous delivery pipelines
  • Experience in mission-focused operational environments

๐ŸŽ Benefits & Perks

  • ๐Ÿ–๏ธ Unlimited PTO
  • ๐Ÿฅ Health insurance
  • ๐Ÿ“š Certification reimbursement
  • ๐Ÿ’ฐ 401(k) matching

๐Ÿ“จ Hiring Process

Estimated timeline: 2-4 weeks ยท AI estimate

  1. 1Recruiter Callยท 30 min
  2. 2Technical Interviewยท 60 min
  3. 3Hiring Manager Interviewยท 45 min

This description was AI-summarized. View original

0 0 0