Identify and investigate problems. Conduct root cause analysis and identify suitable solutions to ensure availability and reliability of applications, services, infrastructure and platforms.
Areas of responsibility may include but not limited to
Conduct trend analysis of data, both systematically and manually to determine common occurrences and recurring issues to feed into the Problem Management processes.
Perform impact assessments to determine priority of a problem relative to other problems and business activities.
Determine the absolute cause of the problem by various means, including, but not limited to recreating the issues in a test environment, reviewing system design, asking relevant questions to stakeholders.
Provide recommendations for improvement to services or processes with the purpose of increased availability, improved service levels, reduced costs, and improved customer satisfaction by reducing the number of operational problems.
Identify interim and long-term solutions, considering cost effectiveness and ease of implementation.
Review a business / technical specification prior to development and identify any potential problems that could be created with the proposed solution.
Document and present the Problem Reports related to identified problems. Provide assessments of risks, impacts, severity, possible alternative solutions, status of investigation and recommendations.
Produce project management reports per team (monthly) giving an area-specific analysis of all problems.
Review records produced by Junior team members and provide recommendations for improvement and enhance processes where applicable.
Participate in various cross-functional forums and lead work streams to contribute to the improvement and implementation of policies, frameworks and standards.
Client and Executive reporting.
Clear, concise and timely communication with emphasis on expressing technical issues in a non-technical manner to clients and executives.
Facilitate discussions (including technical discussions) to establish root causes and solutions to any infrastructure, application or process related issues.
Conduct research to establish more efficient ways of performing day to day activities using new technologies or frameworks and identify opportunities for automation.
Drive infrastructure and application performance and availability initiatives.
Produce and present regular reports on availability, capacity and service performance.
Ensure remedial action are taken to rectify identified issues or escalated incidents are attend to.
Participate in the ongoing enhancement of Problem Management policies and processes
Maintenance of the Known Error Database
Participate in the evolution and maintenance of the CMDB whilst leveraging insights to drive Problem Management objectives
Participation and facilitation of Incident Post-mortems
Personal Attributes and Skills
Statistical analysis and reporting
Root Cause Analysis
Business writing (reports) and presentation
Relevant Tertiary qualification (IT or Engineering)
Knowledge and Experience
Kepner & Fourie Root Cause Analysis Framework
Site Reliability Engineering
3 or more years’ experience in problem management
ITSM Tools (ServiceNow experience preferred)
APM and Infrastructure Monitoring Tools (DynaTrace experience preferred)
AWS CloudWatch (beneficial)