2h ago
Senior Distributed ML Engineer
Montreal
full-timeseniorArtificial Intelligence
Tech Stack
Description
You will work closely with ML research scientists to solve difficult training and inference problems using very large models, develop tools for distributed computing, and establish best practices for large-scale ML workflows.
Requirements
- Degree in computer science or related field
- 3+ years experience designing distributed ML training frameworks
- Experience with Megatron, DeepSpeed, HuggingFace Accelerate, FSDP, vLLM, or verl
- Experience with cloud platforms (AWS, GCP, Azure) and workload managers (Ray, SLURM)
- Familiarity with GPU profiling tools and containerization (Docker, Kubernetes)
Responsibilities
- Collaborate with researchers to accelerate research and model training/inference in distributed environments
- Investigate performance bottlenecks and optimize computing resource utilization
- Develop tools and libraries for orchestrating distributed computing resources
- Establish and document best practices for large-scale distributed ML workflows
0 views 0 saves 0 applications