Home
Cape Town
Infrastructure SRE Team Lead

Infrastructure SRE Team Lead

Lesaka Technologies

Cape Town Full-day Full-time

Description:

A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division – Kazang, in Cape Town, Century City.

This role is ideal for a seasoned Infrastructure SRE professional looking to take on a leadership position and drive innovation within a dynamic team.

We are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise in Linux-based, open-source environments to lead a team ensuring the reliability, scalability, and performance of our critical systems. This role involves technical leadership, strategic planning, and hands-on implementation of automated solutions for system monitoring, optimization, and infrastructure management. You will collaborate with the DevOps and engineering teams, guiding best practices in CI/CD, observability, and infrastructure automation, while mentoring a team to enhance system resilience and operational efficiency.

Key Responsibilities include, but are not limited to:

Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
Provide technical guidance and career development support for team members.
Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
Perform regular system maintenance, including updates, patches, and optimizations.
Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
Identify and mitigate risks to system reliability, scalability, and security.
Ensure compliance with organizational and regulatory standards in system design and operations.
Manage on-call rotations and incident response protocols.
In order to be considered for this position, the following requirements must be met:

Bachelor of Science or any related tertiary qualification.

A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
Proven ability to mentor and develop team members.

Competencies required:

Excellent leadership and communication skills.
Strategic thinker with a proactive and results-oriented approach.
Ability to build and maintain strong cross-functional relationships.
High attention to detail and ability to enforce best practices.
Passion for technology and continuous learning.
Strong problem-solving and analytical skills.
Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments
Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.

Requirements:

Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
Provide technical guidance and career development support for team members.
Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
Perform regular system maintenance, including updates, patches, and optimizations.
Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
Identify and mitigate risks to system reliability, scalability, and security.
Ensure compliance with organizational and regulatory standards in system design and operations.
Manage on-call rotations and incident response protocols.
In order to be considered for this position, the following requirements must be met:

A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
Proven ability to mentor and develop team members.

Excellent leadership and communication skills.
Strategic thinker with a proactive and results-oriented approach.
Ability to build and maintain strong cross-functional relationships.
High attention to detail and ability to enforce best practices.
Passion for technology and continuous learning.
Strong problem-solving and analytical skills.
Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments
Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.

16 Apr 2025; from: careers24.com

Similar jobs