Senior Site Reliability Engineer — Token Factory (Inference Platform) at Find your role — CareerPair

2h ago

Senior Site Reliability Engineer — Token Factory (Inference Platform)

Amsterdam, Netherlands; Berlin, Germany; London, United Kingdom; Prague, Czech Republic; Remote - Europe; Remote - United States; United States

full-timesenior RemoteCloud Computing / AI Infrastructure

Tech Stack

Description

You will own the reliability, performance, and observability of the entire inference stack at Nebius Cloud, designing telemetry pipelines, tuning Kubernetes autoscalers, and hardening request-routing to ensure flawless behavior under extreme load. Your work directly scales a platform that deploys text, vision, audio, and multimodal models at massive scale.

Requirements

Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform
Proficiency in Python or Bash scripting
Experience with alert design and SLOs for high-throughput APIs
Experience with GPU-heavy workloads (vLLM, Triton, Ray, or similar)
Background in MLOps or model-hosting platforms

Responsibilities

Design and refine telemetry pipelines for metrics, logs, and traces
Tune Kubernetes autoscalers for GPU efficiency
Craft Terraform modules for resilient cluster deployments
Harden request-routing and retry logic for transparent failure handling
Drive incident detection, isolation, remediation, and post-mortem culture

Find your role

Other jobs at Find your role

No other jobs found.

0 views 0 saves 0 applications