about 4 hours ago
Site Reliability Engineer
Nepal
full-timemidhealthcare technology
Tech Stack
Description
You will support production operations, incident response, and post-launch reliability for a healthcare data platform. You'll work on complex issues involving AWS, Databricks, and data pipelines, debug live systems, and write production code to improve reliability.
Requirements
- 3-5 years of experience in software engineering, SRE, sustaining engineering, or production operations.
- Hands-on experience supporting production systems in AWS.
- Experience troubleshooting large-scale data platforms or Databricks.
- Proficiency in Python and experience building or supporting production services or tooling.
- Working knowledge of distributed systems fundamentals, incident response and RCA practices, monitoring/alerting/observability, CI/CD pipelines and Infrastructure as Code.
Responsibilities
- Act as a technical escalation point during production incidents, leading triage, mitigation, and recovery efforts.
- Drive root cause analysis (RCA) and ensure follow-up remediation and reliability improvements.
- Support post-launch reliability by investigating and resolving production defects and field-reported issues.
- Partner with Engineering, Data, and Customer Success teams to support customers during incidents.
- Improve operational readiness through runbooks, monitoring, alerting, and incident response practices.
- Work directly with customers on complex deployments, integrations, and production troubleshooting.
- Write production-quality code (primarily Python) to automate operational workflows and improve observability.
- Support and troubleshoot AWS- and Databricks-based production systems.
0 views 0 saves 0 applications