5h ago

Senior Production Engineer

Warsaw, Poland

$250,000-$428,000 / year

full-timeseniorCloud Computing / AI Infrastructure

Tech Stack

Description

As a Production Engineer at CoreWeave, you will be responsible for maintaining the reliability and stability of our cloud infrastructure. You will monitor system performance, respond to incidents, collaborate with teams to improve platform resilience, and implement automation to reduce manual intervention. This role offers the opportunity to work with cutting-edge AI cloud technology and grow your expertise in incident management and system reliability.

Requirements

  • 5+ years of experience in cloud operations, site reliability engineering (SRE), or related technical roles.
  • Strong understanding of cloud platforms (e.g., Kubernetes, AWS, GCP) and cloud infrastructure.
  • Expertise in scripting or automation tools such as Python, Bash, Terraform, or Ansible.
  • Familiarity with incident management practices and frameworks like ITIL or SRE best practices.
  • Experience with monitoring and alerting tools including Prometheus and Grafana.

Responsibilities

  • Assist in incident response efforts, identify and resolve service disruptions, document root cause analysis (RCA) and post-incident reviews (PIRs).
  • Monitor system performance and health using Prometheus and Grafana to identify potential incidents.
  • Implement automation to reduce manual intervention.
  • Collaborate across teams to improve platform reliability and resilience.
  • Refine incident response playbooks.
0 views 0 saves 0 applications