Where

Infrastructure SRE Team Lead

Lesaka Technologies
Cape Town Full-day Full-time

Description:

A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division – Kazang, in Cape Town, Century City.


This role is ideal for a seasoned Infrastructure SRE professional looking to take on a leadership position and drive innovation within a dynamic team.

We are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise in Linux-based, open-source environments to lead a team ensuring the reliability, scalability, and performance of our critical systems. This role involves technical leadership, strategic planning, and hands-on implementation of automated solutions for system monitoring, optimization, and infrastructure management. You will collaborate with the DevOps and engineering teams, guiding best practices in CI/CD, observability, and infrastructure automation, while mentoring a team to enhance system resilience and operational efficiency.

Key Responsibilities include, but are not limited to:

  • Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
  • Provide technical guidance and career development support for team members.
  • Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
  • Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
  • Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
  • Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
  • Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
  • Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
  • Perform regular system maintenance, including updates, patches, and optimizations.
  • Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
  • Identify and mitigate risks to system reliability, scalability, and security.
  • Ensure compliance with organizational and regulatory standards in system design and operations.
  • Manage on-call rotations and incident response protocols.
  • In order to be considered for this position, the following requirements must be met:

Bachelor of Science or any related tertiary qualification.
  • A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
  • Proven ability to mentor and develop team members.
Competencies required:
  • Excellent leadership and communication skills.
  • Strategic thinker with a proactive and results-oriented approach.
  • Ability to build and maintain strong cross-functional relationships.
  • High attention to detail and ability to enforce best practices.
  • Passion for technology and continuous learning.
  • Strong problem-solving and analytical skills.
  • Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
  • Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
  • Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
  • Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
  • Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
  • Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments
  • Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
  • Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.

Requirements:

  • Lead and mentor a team, fostering a culture of reliability, automation, and continuous improvement.
  • Provide technical guidance and career development support for team members.
  • Design, implement, and maintain reliable systems in a Linux and open-source environment to meet uptime and performance objectives.
  • Support the DevOps team with CI/CD pipelines, ensuring seamless and reliable deployments.
  • Manage and optimize AWS-based infrastructure for scalability, cost efficiency, and performance.
  • Develop and maintain monitoring and alerting systems to ensure observability and proactively address system issues.
  • Build and maintain robust solutions for metric collection, dashboarding, and alerting to provide actionable insights and real-time system visibility.
  • Conduct root cause analysis for incidents, implementing preventive measures to improve system resilience.
  • Perform regular system maintenance, including updates, patches, and optimizations.
  • Prepare and deliver comprehensive reporting on system performance, incidents, and reliability metrics.
  • Identify and mitigate risks to system reliability, scalability, and security.
  • Ensure compliance with organizational and regulatory standards in system design and operations.
  • Manage on-call rotations and incident response protocols.
  • In order to be considered for this position, the following requirements must be met:
  • A minimum of 5 years of professional experience in Site Reliability Engineering, DevOps, or a related field, with demonstrated expertise in Linux-based, open-source environments, and cloud infrastructure (AWS), wanting to progress into a leadership capacity.
  • Proven ability to mentor and develop team members.
  • Excellent leadership and communication skills.
  • Strategic thinker with a proactive and results-oriented approach.
  • Ability to build and maintain strong cross-functional relationships.
  • High attention to detail and ability to enforce best practices.
  • Passion for technology and continuous learning.
  • Strong problem-solving and analytical skills.
  • Expertise in diagnosing and resolving complex system issues, including performance bottlenecks, service outages, and application errors, using debugging tools, logs, and monitoring data.
  • Proficiency in at least one programming or scripting language (e.g., Python, Bash, Go), with the ability to write automation scripts, develop tools, and optimize system performance.
  • Hands-on experience with AWS services (e.g., EC2, S3, RDS, VPC), with the ability to design, manage, and optimize cloud-based infrastructure for scalability, reliability, and cost-efficiency.
  • Skilled in implementing monitoring solutions and designing systems for metrics collection, dashboarding, and alerting to ensure system health and performance.
  • Proficiency with tools like Ansible, Terraform, or similar frameworks to automate system management, deployments, and configurations, reducing manual effort and ensuring consistency.
  • Demonstrates a proactive and analytical approach to identifying issues, diagnosing root causes, and implementing effective solutions in complex technical environments
  • Works effectively with cross-functional teams, including DevOps, development, and operations, fostering a culture of shared ownership and open communication to achieve reliability goals.
  • Embraces change, learns new technologies quickly, and adjusts strategies to meet evolving system and organizational needs, particularly in fast-paced, dynamic environments.
16 Apr 2025;   from: careers24.com

Similar jobs

  • Lesaka Technologies
  • Cape Town
Description: A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division – ... are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise ...
6 days ago
  • Lesaka Technologies
  • Cape Town
Description: A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division – ... are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise ...
6 days ago
  • Lesaka Technologies
  • Cape Town
Description: A vacancy exists for a Infrastructure SRE Team Lead within the Micro Merchant Division – ... are seeking an experienced Infrastructure Site Reliability Engineer (SRE) Team Lead with deep expertise ...
6 days ago
  • Lesaka Technologies
  • Cape Town
Description: A vacancy exists for a DevOps Team Lead within the Micro Merchant Division – Kazang , in Cape Town, Century City . This role is ideal for a seasoned DevOps professional looking to take on a leadership position and drive innovation within a ...
6 days ago