5h ago

Operations Engineer, Fleet Reliability

Poland
full-timemidcloud computing

Tech Stack

Description

You will drive server nodes through provisioning and validation processes, troubleshooting hardware and software issues to maximize uptime of high-performance supercomputing clusters. This role involves configuring and maintaining large-scale GPU clusters, working shifts from 7 am to 9 pm, and participating in on-call rotations. Onboarding training at US headquarters is required within the first month.

Requirements

  • 2+ years experience in data center or on-prem infrastructure
  • Strong Linux system administration and networking knowledge
  • Ability to troubleshoot hardware and software issues
  • Bachelor's degree or equivalent experience
  • Ability to travel to US on short notice (ESTA or B-1 visa)

Responsibilities

  • Provision and validate batches of server nodes
  • Troubleshoot node and cluster issues efficiently
  • Configure and maintain large-scale GPU clusters
  • Perform system maintenance tasks reliably
  • Participate in on-call rotations including after-hours and weekends
0 views 0 saves 0 applications