1d ago

Software Engineer, Internal Infrastructure

United Kingdom

$150k-$250k / yearest.

full-timemid Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll build and operate Kubernetes GPU superclusters across multiple clouds, working closely with AI researchers to support cutting-edge model training. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform. You'll also partner with cloud providers to optimize infrastructure costs and performance for AI workloads.

🎯 What You'll Do

  • Build and operate Kubernetes compute superclusters across multiple clouds
  • Partner with cloud providers to optimize infrastructure costs and performance
  • Work with research teams to improve stability and efficiency of model training
  • Design and build resilient systems for training AI models

📋 Requirements

  • Deep experience running Kubernetes clusters at scale
  • Strong programming skills in Go or Python
  • Experience with Infrastructure as Code and Cloud Native infrastructure
  • Self-directed and adaptable problem-solving skills

✨ Nice to Have

  • ML training infrastructure and GPU workload experience
  • RDMA networking familiarity
  • Low-level Linux expertise

🎁 Benefits & Perks

  • 🤝 Inclusive culture and work environment
  • 🧑‍💻 Cutting-edge AI research team
  • 🦷 Full health and dental benefits with mental health budget
  • 🐣 Parental Leave top-up up to 6 months
  • ✈️ 6 weeks vacation (30 working days)
0 0 0