Fal.ai is seeking a Senior/Staff Site Reliability Engineer to manage and enhance their production infrastructure, focusing on systems reliability, automation, and incident response. The ideal candidate will have extensive experience with Kubernetes, CI/CD pipelines, and production systems, and be driven by a culture of continuous improvement
Job Summary
Own and operate our Kubernetes infrastructure, including cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads.
Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability.
Define and enforce SLOs and build out incident response processes, while managing and improving networking, load balancing, and service mesh configurations.
Matching Summary
Match Score: 85
Fal.ai is seeking a Senior/Staff Site Reliability Engineer to manage and enhance their production infrastructure, focusing on systems reliability, automation, and incident response. The ideal candidate will have extensive experience with Kubernetes, CI/CD pipelines, and production systems, and be driven by a culture of continuous improvement.