20h ago
Software Engineer, Internal Infrastructure
United Kingdom
✨ $150k-$250k / yearest.
full-timemid Remoteai-ml
🛠 Tech Stack
💼 About This Role
You'll build and operate Kubernetes GPU superclusters across multiple clouds, working closely with AI researchers to support cutting-edge model training. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform. You'll also partner with cloud providers to optimize infrastructure costs and performance for AI workloads.
🎯 What You'll Do
- Build and operate Kubernetes compute superclusters across multiple clouds
- Partner with cloud providers to optimize infrastructure costs and performance
- Work with research teams to improve stability and efficiency of model training
- Design and build resilient systems for training AI models
📋 Requirements
- Deep experience running Kubernetes clusters at scale
- Strong programming skills in Go or Python
- Experience with Infrastructure as Code and Cloud Native infrastructure
- Self-directed and adaptable problem-solving skills
✨ Nice to Have
- ML training infrastructure and GPU workload experience
- RDMA networking familiarity
- Low-level Linux expertise
🎁 Benefits & Perks
- 🤝 Inclusive culture and work environment
- 🧑💻 Cutting-edge AI research team
- 🦷 Full health and dental benefits with mental health budget
- 🐣 Parental Leave top-up up to 6 months
- ✈️ 6 weeks vacation (30 working days)
0 0 0