2h ago
Senior Site Reliability Engineer
Portugal
full-timesenior Remotedocument workflow automation
Tech Stack
+2
Description
You will own the incident management process, maintain observability tools, keep production applications running via on-call rotation, develop automation for platform reliability, and contribute to production services with a focus on performance and resiliency. Collaborate with product engineers and mentor team members to foster SRE principles.
Requirements
- Solid programming experience in Python (Django, AsyncIO) and/or Java (Spring Boot)
- Experience maintaining observability tools suite (Loki, Grafana, Tempo, Mimir)
- Experience developing and maintaining Python services in production
- Strong experience with AWS and Kubernetes
- Proficiency in relational databases (PostgreSQL) and messaging systems (RabbitMQ, NATS, Kafka)
Responsibilities
- Own and influence the incident management process end-to-end
- Maintain and evolve on-prem observability stack (LGTM)
- Participate in on-call rotation to keep production applications running
- Develop automations and tools to support platform reliability
- Contribute to production services with performance and resiliency in mind
0 views 0 saves 0 applications