5h ago

Engineer, Supercomputing & Distributed Systems

San Francisco

$170k-$240k / yearest.

full-timeseniorai-ml

🛠 Tech Stack

💼 About This Role

You'll design and operate the infrastructure for Krea's research and inference, including distributed training, GPU clusters, and petabyte-scale data pipelines. Your work will directly power next-generation AI creative tools. You'll build custom distributed datastores and streaming pipelines from scratch, solving complex orchestration and scaling challenges at massive scale.

🎯 What You'll Do

  • Design and build distributed data pipelines for petabyte-scale datasets
  • Manage and scale 1000+ GPU Kubernetes clusters for training and inference
  • Profile and optimize distributed training infrastructure (NCCL, InfiniBand)
  • Develop custom orchestration and fault tolerance systems for large-scale ML

📋 Requirements

  • Strong fundamentals in distributed systems design and debugging
  • Proficiency in Python and experience with data tools (DuckDB, PyArrow, Pandas)
  • Experience with Kubernetes for container orchestration at scale
  • Deep understanding of OS, networking, and file-system internals

✨ Nice to Have

  • Experience with PyTorch internals and custom dataloaders
  • Knowledge of NCCL, InfiniBand, or RDMA
  • Experience with streaming systems like Kafka or Pulsar

🎁 Benefits & Perks

  • 🏖️ Flexible PTO
  • 🏥 Health insurance
  • 📈 Equity
0 0 0