2h ago
Senior Site Reliability Engineer — Token Factory (Inference Platform)
Amsterdam, Netherlands; Berlin, Germany; London, United Kingdom; Prague, Czech Republic; Remote - Europe; Remote - United States; United States
full-timesenior RemoteCloud Computing / AI Infrastructure
Tech Stack
Description
You will own the reliability, performance, and observability of the entire inference stack at Nebius Cloud, designing telemetry pipelines, tuning Kubernetes autoscalers, and hardening request-routing to ensure flawless behavior under extreme load. Your work directly scales a platform that deploys text, vision, audio, and multimodal models at massive scale.
Requirements
- Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform
- Proficiency in Python or Bash scripting
- Experience with alert design and SLOs for high-throughput APIs
- Experience with GPU-heavy workloads (vLLM, Triton, Ray, or similar)
- Background in MLOps or model-hosting platforms
Responsibilities
- Design and refine telemetry pipelines for metrics, logs, and traces
- Tune Kubernetes autoscalers for GPU efficiency
- Craft Terraform modules for resilient cluster deployments
- Harden request-routing and retry logic for transparent failure handling
- Drive incident detection, isolation, remediation, and post-mortem culture
0 views 0 saves 0 applications