Expertise in ray, spark, jupyterhub, slurm, or kubernetes
CrowdStrike is a global leader in cybersecurity protecting modern organizations with an advanced AI-native platform processing trillions of events daily
Job Summary
CrowdStrike is a global leader in cybersecurity protecting modern organizations with an advanced AI-native platform processing trillions of events daily.
The role involves diagnosing complex distributed systems issues and ensuring platform reliability for ML infrastructure processing billions of events daily.
CrowdStrike offers market-leading compensation, comprehensive wellness programs, professional development opportunities, and a vibrant office culture with world-class amenities.
Matching Summary
CrowdStrike is a global leader in cybersecurity protecting modern organizations with an advanced AI-native platform processing trillions of events daily.
Skills & Requirements
Must-have
Distributed systems engineering
Debugging ML platforms in production
Expertise in Ray, Spark, JupyterHub, SLURM, or Kubernetes
Performance profiling and optimization
Python debugging and multi-language proficiency
Cloud infrastructure experience AWS/GCP/Azure/OCI
Nice-to-have
Open-source ML infrastructure contributions
Experience with high-throughput inference systems
Published debugging guides or tools
Chaos engineering and GPU/CUDA debugging
On-call and incident management experience
Collaborative and mentoring skills
Key Requirements
12+ years in distributed systems engineering
5+ years debugging ML platforms in production
Expertise in at least three of Ray, Spark, JupyterHub, SLURM, Kubernetes