2h ago

Senior Distributed ML Engineer

Montreal
full-timeseniorArtificial Intelligence

Tech Stack

Description

You will work closely with ML research scientists to solve difficult training and inference problems using very large models, develop tools for distributed computing, and establish best practices for large-scale ML workflows.

Requirements

  • Degree in computer science or related field
  • 3+ years experience designing distributed ML training frameworks
  • Experience with Megatron, DeepSpeed, HuggingFace Accelerate, FSDP, vLLM, or verl
  • Experience with cloud platforms (AWS, GCP, Azure) and workload managers (Ray, SLURM)
  • Familiarity with GPU profiling tools and containerization (Docker, Kubernetes)

Responsibilities

  • Collaborate with researchers to accelerate research and model training/inference in distributed environments
  • Investigate performance bottlenecks and optimize computing resource utilization
  • Develop tools and libraries for orchestrating distributed computing resources
  • Establish and document best practices for large-scale distributed ML workflows
0 views 0 saves 0 applications