1d ago
Machine Learning Systems & Infrastructure Engineer
London
โจ $125k-$175k / yearest.
full-timemidai-ml
๐ Tech Stack
+6
๐ผ About This Role
You'll build and own the systems that turn raw real-world data into trained world models and reliable production endpoints for a generative 3D AI company. You will design, implement, and operate scalable training stacks, data ingestion pipelines, and model serving, working closely with the research team in a hands-on, code-heavy role. This is a unique opportunity to shape the infrastructure for next-generation world models.
๐ฏ What You'll Do
- Own and evolve ML systems for training, evaluation, and serving of large foundation models.
- Improve distributed training stacks (PyTorch DDP/FSDP) for performance and stability.
- Build end-to-end data pipelines for ingestion, preprocessing, and storage at petabyte scale.
- Operate ML workflow orchestration and model serving platforms (Kubeflow, Airflow, Modal).
- Manage containerization, IaC (Terraform), and CI/CD for GPU workloads.
๐ Requirements
- 3+ years writing production-quality Python in a large codebase.
- Hands-on with modern ML training stacks (PyTorch, DDP/FSDP) and debugging distributed jobs.
- Shipped end-to-end data pipelines at scale with real-world sources.
- Proficient with containers (Docker, Kubernetes) and IaC (Terraform).
โจ Nice to Have
- Experience with ML workflow orchestration (Kubeflow Pipelines, Airflow) and experiment tracking (MLflow).
- Knowledge of observability tooling (Prometheus/Grafana, OpenTelemetry).
๐ Benefits & Perks
- ๐๏ธ Flexible time off
- ๐ฐ Equity package
- ๐ง Learning budget
- ๐ข Central London office
- ๐ฝ๏ธ Daily lunch provided
๐จ Hiring Process
Estimated timeline: 2-3 weeks ยท AI estimate
- 1Recruiter Callยท 30 min
- 2Technical Interviewยท 60 min
- 3Final Roundยท 45 min
0 0 0