1h ago
Senior Software Engineer, ML Platform (Stability & Infrastructure)
London
full-timeseniorbiotechnology
Tech Stack
Description
You will own the end-to-end strategy for platform reliability, focusing on GPU/TPU infrastructure and workload orchestration. You'll design robust test harnesses, optimize inference services, and overhaul monitoring systems to ensure research workflows run smoothly.
Requirements
- Proven experience in architecting and managing large-scale AI/ML workloads in production
- Expertise in Google Cloud Platform (GCP) cloud compute design
- Significant experience deploying and managing complex workloads on Kubernetes (GKE)
- Professional familiarity with NVIDIA GPU generations and high-performance compute
- Strong programming skills with a reliability-first approach
Responsibilities
- Own end-to-end strategy for platform reliability focused on accelerator infrastructure and workload orchestration
- Design and implement a robust test harness to validate infrastructure upgrades
- Architect and optimize next-generation inference services for high-throughput performance
- Overhaul logging and monitoring systems for proactive alerting and telemetry
- Improve internal CI/CD stability and reduce failure rates
0 views 0 saves 0 applications