Ensure the reliability, availability, and performance of Unix/Linux platforms by proactively monitoring, optimizing, and improving system stability using SRE methodologies
Job Summary
Ensure the reliability, availability, and performance of Unix/Linux platforms by proactively monitoring, optimizing, and improving system stability using SRE methodologies.
Lead incident response for high‑severity issues, perform root-cause analysis (RCA), and implement permanent fixes to prevent recurrence.
Contribute to the design, enhancement, and lifecycle management of Unix/Linux platform services, ensuring they meet reliability, security, and performance standards.
Matching Summary
Ensure the reliability, availability, and performance of Unix/Linux platforms by proactively monitoring, optimizing, and improving system stability using SRE methodologies.
Skills & Requirements
Must-have
Unix/Linux platform reliability
SRE methodologies
Incident response and RCA
Infrastructure automation
Observability and monitoring
Capacity and performance engineering
System hardening and patching
Nice-to-have
Continuous improvement mindset
Collaborative problem-solving
Data-driven decisions
Willingness to participate in on-call rotation
Key Requirements
Bachelor’s degree in Computer Science, IT, Engineering or related field
Strong expertise in Unix/Linux administration (RHEL, AIX, Solaris)
Strong shell scripting skills (Bash, ksh)
Strong programming/scripting skills in Python, Perl
Hands-on experience with automation tools (Ansible, Terraform)
Practical experience with monitoring platforms (Datadog, Nagios)
Experience with virtualisation technologies, SAN Infrastructure, Backup & Restore
Strong experience in on-call operations and incident response