2h ago

Network Site Reliability Engineer (NetSRE)

Amsterdam, Netherlands; Remote - Europe
full-timesenior Remotecloud computing

Tech Stack

Description

You will build and operate the network infrastructure that powers Nebius's AI cloud, setting reliability targets, automating operations, and improving observability to ensure the network scales safely and efficiently.

Requirements

  • Strong production Linux fundamentals and structured debugging
  • Solid networking basics (control vs data plane, latency, failure domains)
  • Hands-on experience operating high-availability systems with iterative improvement
  • Ability to write software/automation (Go or Python)
  • Experience with modern infrastructure tooling (IaC, CI/CD, container platforms)

Responsibilities

  • Define and own reliability goals for network services (SLIs/SLOs, error budgets)
  • Drive reliability improvements across network, site readiness, and inter-site connectivity
  • Own incident response, lead postmortems, and implement durable fixes
  • Build and evolve observability (metrics, logs, traces, alerting)
  • Design safer change workflows (automation, CI/CD, canarying, rollbacks)
0 views 0 saves 0 applications