about 5 hours ago
Senior Software Engineer, Reliability Engineering
São Paulo, Brazil
full-timesenior Hybridtravel
Tech Stack
Description
You will develop and maintain tools and systems that enable engineering teams to operate services reliably at scale. You'll serve as an Incident Commander during high-severity incidents, guiding cross-functional teams to minimize impact on customers and business.
Requirements
- Bachelor's degree in Computer Science or related field
- 5+ years experience in software engineering or SRE with large-scale distributed systems
- Strong coding skills in Java, Python, or Go
- Experience with distributed systems and service-oriented architectures
- Experience with cloud platforms like AWS or Google Cloud Platform
- Experience with containerization technologies like Docker and Kubernetes
- Excellent problem-solving and analytical skills
- Strong communication and interpersonal skills
- Fluent in English (Professional Level)
Responsibilities
- Design, implement and maintain tools for service reliability, monitoring, and alerting
- Collaborate with engineering teams to ensure services are designed with reliability in mind
- Identify and drive improvements to reliability, scalability, and efficiency
- Develop tools to help infrastructure engineers manage operational challenges
- Participate in incident response and post-mortems to address systemic issues
- Evaluate new technologies and industry best practices for SRE tooling and incident response
- Lead high-urgency incidents and mentor less-experienced engineers
0 views 0 saves 0 applications