15h ago
Machine Learning & Cloud Infra Engineer
London
โจ $150k-$200k / yearest.
full-timemidai-ml
๐ Tech Stack
+2
๐ผ About This Role
You'll build and own the infrastructure for training massive generative 3D models at SpAItial. You'll design GPU clusters, distributed training systems, and storage pipelines that enable researchers to train world-scale models efficiently. This role combines deep systems engineering with direct impact on cutting-edge AI research.
๐ฏ What You'll Do
- Own and evolve ML + cloud infrastructure for training massive foundation models
- Design and operate GPU clusters with scheduling and capacity planning
- Support distributed training stacks (PyTorch DDP/FSDP) for performance and stability
- Build and optimize storage systems for petabyte-scale datasets
- Package and deploy workloads with Docker, Kubernetes, and Terraform
๐ Requirements
- 3+ years of professional experience in infrastructure, platform, or cloud engineering
- Hands-on experience with GPU compute and performance debugging (CUDA/NCCL)
- Strong experience operating cloud environments (AWS, GCP, or Azure)
- Proficiency with containers and orchestration (Docker, Kubernetes) and infrastructure-as-code (Terraform)
โจ Nice to Have
- ML infrastructure experience
- Experience with monitoring and observability tooling (Prometheus/Grafana, ELK)
- Experience building CI/CD for infra and ML workflows (CircleCI, GitHub Actions)
๐ Benefits & Perks
- ๐๏ธ Flexible PTO
- ๐ฐ Equity
- ๐ฅ Health Insurance
- ๐ Learning Budget
๐จ Hiring Process
Estimated timeline: 2-4 weeks ยท AI estimate
- 1Recruiter Screenยท 30 min
- 2Technical Interviewยท 60 min
- 3Team Interviewยท 45 min
0 0 0