Market leader in compensation + equity awards; not...
12+ years in distributed systems engineering
5+ years debugging ml platforms in production
Deep expertise in ray, spark, jupyterhub, slurm, or k8s
The role involves diagnosing complex distributed systems issues to maintain CrowdStrike's mission-critical ML infrastructure processing billions of events daily
Job Summary
The role involves diagnosing complex distributed systems issues to maintain CrowdStrike's mission-critical ML infrastructure processing billions of events daily.
Candidates will partner with ML engineers to resolve workflow issues, conduct post-mortems, and mentor others on debugging techniques.
CrowdStrike offers market-leading compensation, comprehensive wellness programs, and a culture that provides flexibility and autonomy to own careers.
Matching Summary
The role involves diagnosing complex distributed systems issues to maintain CrowdStrike's mission-critical ML infrastructure processing billions of events daily.
Salary
Market leader in compensation and equity awards; Not specified; Comprehensive physical and mental wellness programs included
Skills & Requirements
Must-have
12+ years in distributed systems engineering
5+ years debugging ML platforms in production
Deep expertise in Ray, Spark, JupyterHub, SLURM, or K8s
Performance profiling and optimization skills
Expert Python debugging and Linux/Unix proficiency
Nice-to-have
Open-source ML infrastructure contributions
Experience with high-throughput inference systems
Published debugging guides or tools
Chaos engineering and GPU/CUDA debugging experience
On-call and incident management experience
Key Requirements
12+ years in distributed systems engineering
5+ years debugging ML platforms in production
Expertise in at least three of: Ray, Spark, JupyterHub, SLURM, K8s