5h ago
Engineer, Supercomputing & Distributed Systems
San Francisco
✨ $170k-$240k / yearest.
full-timeseniorai-ml
🛠 Tech Stack
💼 About This Role
You'll design and operate the infrastructure for Krea's research and inference, including distributed training, GPU clusters, and petabyte-scale data pipelines. Your work will directly power next-generation AI creative tools. You'll build custom distributed datastores and streaming pipelines from scratch, solving complex orchestration and scaling challenges at massive scale.
🎯 What You'll Do
- Design and build distributed data pipelines for petabyte-scale datasets
- Manage and scale 1000+ GPU Kubernetes clusters for training and inference
- Profile and optimize distributed training infrastructure (NCCL, InfiniBand)
- Develop custom orchestration and fault tolerance systems for large-scale ML
📋 Requirements
- Strong fundamentals in distributed systems design and debugging
- Proficiency in Python and experience with data tools (DuckDB, PyArrow, Pandas)
- Experience with Kubernetes for container orchestration at scale
- Deep understanding of OS, networking, and file-system internals
✨ Nice to Have
- Experience with PyTorch internals and custom dataloaders
- Knowledge of NCCL, InfiniBand, or RDMA
- Experience with streaming systems like Kafka or Pulsar
🎁 Benefits & Perks
- 🏖️ Flexible PTO
- 🏥 Health insurance
- 📈 Equity
0 0 0