5h ago

Senior Site Reliability Engineer, Production Engineering

Costa Mesa, California, United States
full-timeseniordefense technology

Tech Stack

+2

Description

You will design and implement monitoring and observability systems, drive incident response, build infrastructure automation, establish SLOs, and partner with software teams to improve reliability for the Lattice platform at Anduril Industries, a defense technology company. Your work will directly support mission-critical systems for U.S. and allied military capabilities.

Requirements

  • 7+ years of engineering experience with at least 3+ years in SRE, production operations, or infrastructure engineering
  • Deep expertise with Kubernetes in production environments (100+ nodes)
  • Strong programming skills in Go, Python, Rust, or Java
  • Proven experience with observability stacks (Prometheus, Grafana, ELK/EFK)
  • Hands-on experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code

Responsibilities

  • Design and implement monitoring, observability, and alerting systems
  • Drive incident response and conduct blameless postmortems
  • Build and maintain infrastructure automation using Terraform, Kubernetes operators, and custom tooling
  • Establish and track Service Level Objectives (SLOs) and Error Budgets
  • Develop capacity planning models and performance testing frameworks
0 views 0 saves 0 applications