Problem Management
Definition:- Problem management is the process of identifying and managing the causes of incidents on an IT service. It is a core component of ITSM frameworks.
Scope:- Problem management has a very limited scope and includes the following activities:
- Problem detection
- Problem logging
- Problem categorization
- Problem prioritization
- Problem investigation and diagnosis
- Creating a known error record
- Problem resolution and closure
- Major problem review
Process:- The ITIL problem management process has many steps, and each is vitally important to the success of the process and the quality of service delivered.
- The first step is to detect the problem. A problem is raised either through escalation from the service desk, or through proactive evaluation of incident patterns and alerts from event management or continual service improvement processes. Signs of a problem include incidents that occur across the organization with similar conditions, incidents that repeat despite otherwise successful troubleshooting, and incidents that are unresolvable at the service desk.
- The second step is to log the problem. In an ITIL framework, problems are logged in a problem record. A problem record is a compilation of every problem in an organization. This can be accomplished via a ticketing system that allows for problem ticket types. Pertinent problem data, such as the time and date of occurrence, the related incident(s), the symptoms, previous troubleshooting steps, and the problem category all help the problem management team research the root cause.
- The third step is to categorize the problem. Problem categorization should match incident categorization. Incident [and problem] categorization involves assigning a main and secondary category to the issue. This step is beneficial in several ways. One benefit is that it allows the service desk to sort and model incidents that occur regularly. The modeling allows for automatic assignment of prioritization. The third and most important benefit is the ability to gather and report on service desk data. This data allows the organization to not only track problem trends, but also to assess its effect on service demand and service provider capacity.
- The fourth step is to prioritize the problem. A problem’s priority is determined by its impact on users and on the business and its urgency. Urgency is how quickly the organization requires a resolution to the problem. The impact is a measure of the extent of potential damage the problem can cause the organization. Prioritizing the problem allows an organization to utilize investigative resources most effectively. It also allows organizations to mitigate damage to the service level agreement (SLA) by reallocating resources as soon as the issue is known.
- The fifth step is a two-part process, which involves investigating and diagnosing the problem. The speed at which a problem is investigated and diagnosed depends on its assigned priority. High-priority issues should always be addressed first, as their impact on services is the greatest. Correct categorization helps here, since identifying trends is easier when problem categories correlate to incident categories. Diagnosis usually involves analyzing the incidents that lead to the problem report as well as further testing that may not be possible at the service desk level, such as advanced log analysis.
- The sixth step is to identify a workaround for the problem. A workaround should always be indicated, because problems are not resolved at the incident level. A workaround enables the service desk to restore services to users while the problem is being resolved. A problem can take anywhere from an hour to months to resolve, therefore a workaround is vital. A problem is considered open until resolved, so a workaround should only be considered a temporary measure.
- Step seven is to raise a known error record. Once the workaround has been identified, it should be communicated to staff within the organization as a known error. It’s good practice to record a known error in both an incident knowledge base and what ITIL calls a known error database (KEDB). Documenting the workaround allows the service desk to resolve incidents quickly and avoid further problems being raised on the same issue.
- Step eight is to resolve the problem. Problems should be resolved whenever possible. Resolution resolves the underlying cause of a set of incidents and prevents those incidents from recurring. Some resolutions may require the change management board, as they may affect service levels. For example, a database switchover may cause slowness during the switchover period. All risks should be evaluated and accounted for before implementing the resolution. Document the steps taken to resolve the problem in the organization’s knowledge base.
- The ninth step is to close the problem. This step should only occur after the problem has been raised, categorized, prioritized, identified, diagnosed, and resolved. While many organizations stop at this step, it isn’t the last according to ITIL.
- The final step is to review the problem. This is also known as a major problem review. The major problem review is an organizational activity that prevents future problems. During the review, the problem management team evaluates the problem documentation and identifies what happened and why. Lessons learned, such as process bottlenecks, what went wrong, and what helped should be discussed. This is where having a complete problem log will help. A completed log will work much better than trying to pull the details from memory. This problem review should result in improved processes, staff training, or more complete documentation.