1h ago

Senior Software Engineer, ML Platform (Stability & Infrastructure)

London
full-timeseniorbiotechnology

Tech Stack

Description

You will own the end-to-end strategy for platform reliability, focusing on GPU/TPU infrastructure and workload orchestration. You'll design robust test harnesses, optimize inference services, and overhaul monitoring systems to ensure research workflows run smoothly.

Requirements

  • Proven experience in architecting and managing large-scale AI/ML workloads in production
  • Expertise in Google Cloud Platform (GCP) cloud compute design
  • Significant experience deploying and managing complex workloads on Kubernetes (GKE)
  • Professional familiarity with NVIDIA GPU generations and high-performance compute
  • Strong programming skills with a reliability-first approach

Responsibilities

  • Own end-to-end strategy for platform reliability focused on accelerator infrastructure and workload orchestration
  • Design and implement a robust test harness to validate infrastructure upgrades
  • Architect and optimize next-generation inference services for high-throughput performance
  • Overhaul logging and monitoring systems for proactive alerting and telemetry
  • Improve internal CI/CD stability and reduce failure rates
0 views 0 saves 0 applications