about 4 hours ago

Site Reliability Engineer

Nepal
full-timemidhealthcare technology

Tech Stack

Description

You will support production operations, incident response, and post-launch reliability for a healthcare data platform. You'll work on complex issues involving AWS, Databricks, and data pipelines, debug live systems, and write production code to improve reliability.

Requirements

  • 3-5 years of experience in software engineering, SRE, sustaining engineering, or production operations.
  • Hands-on experience supporting production systems in AWS.
  • Experience troubleshooting large-scale data platforms or Databricks.
  • Proficiency in Python and experience building or supporting production services or tooling.
  • Working knowledge of distributed systems fundamentals, incident response and RCA practices, monitoring/alerting/observability, CI/CD pipelines and Infrastructure as Code.

Responsibilities

  • Act as a technical escalation point during production incidents, leading triage, mitigation, and recovery efforts.
  • Drive root cause analysis (RCA) and ensure follow-up remediation and reliability improvements.
  • Support post-launch reliability by investigating and resolving production defects and field-reported issues.
  • Partner with Engineering, Data, and Customer Success teams to support customers during incidents.
  • Improve operational readiness through runbooks, monitoring, alerting, and incident response practices.
  • Work directly with customers on complex deployments, integrations, and production troubleshooting.
  • Write production-quality code (primarily Python) to automate operational workflows and improve observability.
  • Support and troubleshoot AWS- and Databricks-based production systems.
0 views 0 saves 0 applications