Senior Distributed Ml Engineer

LawZero

Montreal, Canada
Not specified; + compensation totaling 8% of salar...
On-site
3+ years distributed ml training experience
Proficiency with megatron deepspeed fsdp vllm
Gpu profiling using pytorch profiler nsight
The role focuses on solving difficult training and inference problems using very large models within a novel AI safety agenda

Job Summary

  • The role focuses on solving difficult training and inference problems using very large models within a novel AI safety agenda.
  • Candidates will develop tools to simplify and orchestrate distributed computing resources while establishing best practices for large-scale workflows.
  • The company offers comprehensive health benefits, 20 days of vacation, and an employer contribution of 4% to retirement savings.

Matching Summary

The role focuses on solving difficult training and inference problems using very large models within a novel AI safety agenda.

Salary

Not specified; Additional compensation totaling 8% of salary for retirement or bonuses; Employer contributes 4% to retirement savings

Skills & Requirements

Must-have

  • 3+ years distributed ML training experience
  • Proficiency with Megatron DeepSpeed FSDP vLLM
  • GPU profiling using PyTorch profiler Nsight
  • Cloud platform experience AWS GCP Azure
  • Containerization with Docker Kubernetes gRPC

Nice-to-have

  • Advanced degree in machine learning or CS
  • Experience with vector databases
  • Track record in high-quality research projects
  • Collaboration with cross-functional teams
  • Familiarity with Ray SLURM workload managers

Key Requirements

  • Degree in computer science or related field
  • Master's or PhD preferred in ML systems
  • 3+ years designing distributed ML frameworks

Work Rights

Not specified

Tailored Resume

Cover Letter