2h ago

Senior Site Reliability Engineer

Portugal
full-timesenior Remotedocument workflow automation

Tech Stack

+2

Description

You will own the incident management process, maintain observability tools, keep production applications running via on-call rotation, develop automation for platform reliability, and contribute to production services with a focus on performance and resiliency. Collaborate with product engineers and mentor team members to foster SRE principles.

Requirements

  • Solid programming experience in Python (Django, AsyncIO) and/or Java (Spring Boot)
  • Experience maintaining observability tools suite (Loki, Grafana, Tempo, Mimir)
  • Experience developing and maintaining Python services in production
  • Strong experience with AWS and Kubernetes
  • Proficiency in relational databases (PostgreSQL) and messaging systems (RabbitMQ, NATS, Kafka)

Responsibilities

  • Own and influence the incident management process end-to-end
  • Maintain and evolve on-prem observability stack (LGTM)
  • Participate in on-call rotation to keep production applications running
  • Develop automations and tools to support platform reliability
  • Contribute to production services with performance and resiliency in mind
0 views 0 saves 0 applications