Description:
Essential Skills Required:Strong understanding of IT service management principles and practices. Proficiency in monitoring and management tools (e.g., dashboards, alerting systems). Strong analytical and problem-solving abilities, particularly in IT service management. Experience in conducting root cause analysis (RCA) and managing known issues. Experience in performing regular and sporadic operational tasks to ensure optimal performance of IT services. Ability to manage IT service continuity, availability, and capacity effectively. Experience with change management processes, including creating and syncing changes with teams. Ability to plan and execute capacity extensions and backup/restore processes. Any additional responsibilities assigned in the Agile Working Model (AWM) Charter. Advantageous Skills:
Experience with IT service management frameworks (e.g., ITIL, SRE practices). Familiarity with cloud platforms (e.g., Azure) and their operational management. Experience with automation tools (e.g., Ansible, Puppet, Terraform) and scripting languages (e.g., Python, Bash) to streamline operational tasks. Understanding of DevOps methodologies and practices, including CI/CD processes. Knowledge of network protocols, configurations, and troubleshooting to support IT infrastructure. Understanding of IT security best practices and compliance requirements to ensure secure operations. Skills in data analysis and visualization tools (e.g., Splunk, Grafana) to interpret operational metrics and trends. Willing and able to travel internationally (twice a year). Above-board work ethics. Qualifications/Experience:
Minimum of 6 years of experience in IT operations or a similar role. Role and Responsibilities:
Monitor and Operate IT Products: Perform regular and sporadic operational tasks to ensure optimal performance of IT services. Own and maintain the Regular OPS Tasks list, refining sporadic tasks based on input from the Operations Experts (OE) network. Manage IT Service Continuity: Prepare for and attend emergency exercises (EE), reviewing outcomes and deriving follow-up tasks. Communicate findings and improvements to the OE network. Manage Availability: Participate in "Gamedays" and backup/restore test sessions, practicing and executing backup and restore processes. Own the recovery and backup plan, reviewing success and identifying follow-up tasks. Manage Capacity: Monitor cluster capacity using prepared dashboards and coordinate with the DevOps team for any issues. Plan and execute capacity extensions a
15 Apr 2025;
from:
gumtree.co.za