4h ago

Staff Software Engineer, Platform

Remote
full-timesenior Remotenonprofit political technology

Tech Stack

Description

You will serve as a technical leader on the SRE team, owning reliability initiatives, driving observability strategy, and building incident response practices to make ActBlue's systems resilient at scale. You'll collaborate across teams to reduce systemic risk and advance SLO-based reliability.

Requirements

  • 8+ years of experience in SRE, DevOps, or systems/infrastructure engineering
  • Deep expertise in observability tooling (Datadog: APM, RUM, DBM, dashboards, SLOs, alerting)
  • Strong command of Kubernetes and cloud-native infrastructure (EKS via Flux on AWS)
  • Experience defining and operating SLIs and SLOs in production environments
  • Demonstrated ability to lead cross-functional reliability initiatives and build organizational buy-in

Responsibilities

  • Own and drive SRE technical strategy in observability, incident management, reliability engineering, and platform operations
  • Lead architecture decisions for monitoring, alerting, and SLO frameworks; contribute to org-wide RFCs
  • Provide L2 on-call support for complex incidents; build incident response capability across teams
  • Lead multi-quarter SRE initiatives with cross-team dependencies (e.g., observability buildout, on-call training)
  • Define and maintain SLIs/SLOs for tier-1 business flows; contribute to multi-year reliability roadmap
0 views 0 saves 0 applications