7h ago

Member of Engineering (Scalability)

Remote (EMEA/East Coast)

$200k-$350k / yearest.

full-timesenior Remoteai-ml

🛠 Tech Stack

💼 About This Role

You'll join our pre-training team building distributed training and inference of Large Language Models for AGI. Your focus will be software reliability and fault tolerance at scale, minimizing GPU idle time and improving checkpointing. You'll have access to thousands of GPUs to test changes.

🎯 What You'll Do

  • Troubleshoot hardware problems during training at scale
  • Minimize GPU idle time during faults operationally and strategically
  • Design tools to accelerate training recovery
  • Improve performance and reliability of checkpointing

📋 Requirements

  • Understanding of Large Language Models
  • Strong engineering background with Python, PyTorch, or Jax
  • Distributed systems and fault-tolerance knowledge
  • Linux kernel and NCCL experience

✨ Nice to Have

  • Experience with C/C++ and CUDA API
  • Knowledge of Kubernetes stack

🎁 Benefits & Perks

  • 🌍 Fully remote work & flexible hours
  • 🏖️ 37 days/year of vacation & holidays
  • 💊 Health insurance allowance for you & dependents
  • 💻 Company-provided equipment
  • 🏆 Well-being, always-be-learning & home office allowances

📨 Hiring Process

Intro call with Founding Engineer, technical interview(s), team fit call, final interview with Founding Engineer.

0 0 0