Description:
A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division – Kazang, in Cape Town, Century City.
This role is ideal for a seasoned Infrastructure SRE professional looking to take on a leadership position and drive innovation within a dynamic team.
We are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise in Linux-based, open-source environments to lead a team ensuring the reliability, scalability, and performance of our critical systems. This role involves technical leadership, strategic planning, and hands-on implementation of automated solutions for system monitoring, optimization, and infrastructure management. You will collaborate with the DevOps and engineering teams, guiding best practices in CI/CD, observability, and infrastructure automation, while mentoring a team to enhance system resilience and operational efficiency.
Key Responsibilities include, but are not limited to:
- Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
- Provide technical guidance and career development support for team members.
- Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
- Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
- Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
- Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
- Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
- Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
- Perform regular system maintenance, including updates, patches, and optimizations.
- Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
- Identify and mitigate risks to system reliability, scalability, and security.
- Ensure compliance with organizational and regulatory standards in system design and operations.
- Manage on-call rotations and incident response protocols.
- In order to be considered for this position, the following requirements must be met:
Bachelor of Science or any related tertiary qualification.
- A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
- Proven ability to mentor and develop team members.
- Excellent leadership and communication skills.
- Strategic thinker with a proactive and results-oriented approach.
- Ability to build and maintain strong cross-functional relationships.
- High attention to detail and ability to enforce best practices.
- Passion for technology and continuous learning.
- Strong problem-solving and analytical skills.
- Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
- Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
- Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
- Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
- Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
- Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments
- Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
- Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.
Requirements:
- Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
- Provide technical guidance and career development support for team members.
- Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
- Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
- Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
- Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
- Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
- Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
- Perform regular system maintenance, including updates, patches, and optimizations.
- Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
- Identify and mitigate risks to system reliability, scalability, and security.
- Ensure compliance with organizational and regulatory standards in system design and operations.
- Manage on-call rotations and incident response protocols.
- In order to be considered for this position, the following requirements must be met:
- A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
- Proven ability to mentor and develop team members.
- Excellent leadership and communication skills.
- Strategic thinker with a proactive and results-oriented approach.
- Ability to build and maintain strong cross-functional relationships.
- High attention to detail and ability to enforce best practices.
- Passion for technology and continuous learning.
- Strong problem-solving and analytical skills.
- Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
- Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
- Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
- Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
- Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
- Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments
- Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
- Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.