Ensure the reliability, availability, and performance of Unix/Linux platforms by proactively monitoring, optimizing, and improving system stability using SRE methodologies
Job Summary
Ensure the reliability, availability, and performance of Unix/Linux platforms by proactively monitoring, optimizing, and improving system stability using SRE methodologies.
Lead incident response for high‑severity issues, perform root-cause analysis (RCA), and implement permanent fixes to prevent recurrence.
Contribute to the design, enhancement, and lifecycle management of Unix/Linux platform services, ensuring they meet reliability, security, and performance standards.
Matching Summary
Ensure the reliability, availability, and performance of Unix/Linux platforms by proactively monitoring, optimizing, and improving system stability using SRE methodologies.
Skills & Requirements
Must-have
Unix/Linux platform reliability
SRE methodologies
Incident response and RCA
Infrastructure automation
Observability and monitoring (Datadog, Prometheus)
Capacity and performance engineering
System hardening and patching
Nice-to-have
Continuous improvement mindset
Data-driven decisions
Collaborative problem-solving
Willingness to participate in 24*7 on-call rotation
Key Requirements
Bachelor’s degree in Computer Science, IT, Engineering or related