1h ago

Research Engineer (LLM Training and Performance)

Amsterdam, Netherlands; Belgrade, Serbia; Berlin, Germany; Limassol, Cyprus; London, United Kingdom; Madrid, Spain; Munich, Germany; Paphos, Cyprus; Prague, Czech Republic; Warsaw, Poland; Yerevan, Armenia
full-timeseniorSoftware Development

Tech Stack

Description

You will own the training stack and model architecture for our Mellum LLM family, making training faster, cheaper, and more stable at scale. You'll profile, design, and implement changes to the training pipeline, from architecture to custom GPU kernels.

Requirements

  • Strong PyTorch and PyTorch Distributed experience running multi-node jobs with tens to hundreds of GPUs
  • Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or FSDP/ZeRO
  • Real profiling expertise (Nsight Systems/Compute, nvprof) and NVTX-instrumented workflows
  • GPU programming skills with Triton and/or CUDA, able to write, test, and debug kernels
  • Solid understanding of NCCL collectives, topology, fabric effects (IB/RoCE)

Responsibilities

  • Improve end-to-end performance for multi-node LLM pre-training and post-training pipelines
  • Profile and fix hotspots using compute/comm overlap, kernel fusion, scheduling
  • Design and evaluate architecture choices (depth/width, attention variants, MoE routing)
  • Implement custom ops (Triton/CUDA C++) and integrate via PyTorch extensions
  • Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, parallelism, NCCL tuning
0 views 0 saves 0 applications