Meta is seeking a Software Engineer for its AI Networking team, focusing on developing software for multi-GPU and multi-node data communication to enhance distributed machine learning workloads. Ideal candidates will have strong programming skills in C/C++ and Python, along with experience in machine learning and high-performance computing
Job Summary
The team owns the critical software stack around NCCL that enables multi-GPU and multi-node data communication for nearly every distributed GPU-based ML workload at Meta.
Engineers will lead the development of collective communication libraries with a specific focus on improving reliability and performance for large-scale GenAI and LLM training.
This role requires deep expertise in machine learning frameworks like PyTorch and specialized experience in distributed training paradigms such as Data Parallel and Model Parallel.
Matching Summary
Match Score: 85
Meta is seeking a Software Engineer for its AI Networking team, focusing on developing software for multi-GPU and multi-node data communication to enhance distributed machine learning workloads. Ideal candidates will have strong programming skills in C/C++ and Python, along with experience in machine learning and high-performance computing.
Skills & Requirements
Must-have
C/C++ and Python programming skills
Distributed ML Training experience
GPU architecture knowledge
ML systems and AI infrastructure expertise
High performance computing background
Nice-to-have
Experience with NCCL library
PhD in Computer Science or related field
CUDA programming proficiency
RoCE/Infiniband performance analysis
FSDP and Tensor Parallel implementation
Key Requirements
Bachelor's degree in Computer Science or equivalent practical experience
Proven track record of leading successful projects