2h ago

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Amsterdam, Netherlands; Berlin, Germany; London, United Kingdom; Prague, Czech Republic; Remote - Europe; Remote - United States; United States
full-timesenior RemoteCloud Computing / AI Infrastructure

Tech Stack

Description

You will own the reliability, performance, and observability of the entire inference stack at Nebius Cloud, designing telemetry pipelines, tuning Kubernetes autoscalers, and hardening request-routing to ensure flawless behavior under extreme load. Your work directly scales a platform that deploys text, vision, audio, and multimodal models at massive scale.

Requirements

  • Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform
  • Proficiency in Python or Bash scripting
  • Experience with alert design and SLOs for high-throughput APIs
  • Experience with GPU-heavy workloads (vLLM, Triton, Ray, or similar)
  • Background in MLOps or model-hosting platforms

Responsibilities

  • Design and refine telemetry pipelines for metrics, logs, and traces
  • Tune Kubernetes autoscalers for GPU efficiency
  • Craft Terraform modules for resilient cluster deployments
  • Harden request-routing and retry logic for transparent failure handling
  • Drive incident detection, isolation, remediation, and post-mortem culture
0 views 0 saves 0 applications