Its all about getting into it: ITIL in context with Production Support.

reference from the link : https://www.slideshare.net/linpeizhang/deal-with-production-issues-the-itil-way

1. Deal with Production Issues Suggestions from ITIL 2. Problems to solve

Long resolution time

Neglected issues

Issues we lose track of until our users remind us

Recurring issues

Inconsistency in response time

Developers are distracted constantly to resolve issues

3. Goal

Manage issues in a consistent manner

Fast resolution

Reduce client impact

Proactively resolve issues before they impact clients

4. Basic Concepts

Incidents

Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service

Problems

A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms.

Known Errors

A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around

5. Relationship of the three

Problem is the root cause of the incidents

Incident is the manifest of a underline Problem

One Problem can cause many Incidents

Known error is a problem with known root cause and known workaround

6. Manage Incident vs. Manage Problem

Different goals

Incident Management focus on restoring the service operation as quickly as possible

Problem management focus on finding and eliminating the root cause

Different actions

Incident management applies workarounds or temporary fixes to quickly restore the services

Problem management issue a change to fundamentally eliminate the root cause

Incident management is reactive and problem management is proactive

Incident management emphasize speed and problem management emphasize quality

7. Common mistakes

Spend tremendous time and efforts to find root cause before the service level is recovered

Stop the investigation after an incident is fixed by a workaround

Same incident occurs repeatedly without understanding of the root cause

8. Solutions from ITIL

Separate out Incident Management and Problem Management into two independent but related processes

Handle incidents (restore service) as quickly as possible

Proactively and independently work on resolving problems

Wisely manage Known Errors

9. Incident Management

Always remember the goal is to “Restore service level as quickly as possible ”

How to go fast?

Classification

Match known errors and known workarounds

Appropriate escalation

Go fast, but not go crazy. Don’t miss

Record

Prioritize

Follow up

10. Incident Management Process 11. Acceptance And Record

Benefits of recording

Help to diagnosis new incidents based on known incidents

Help Problem Management to find the root cause

Easy to determine the impact

Be able to track and control the issue resolution.

Incident Reporting Channels

User

System Monitor/Alert

IT person

12. Incident Record

Unique ID

Basic diagnosis info

Timestamp

Symptoms

User info (name, contact info)

Who’s responsible

Additional information

Screenshots

Logs

Status

New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated

13. Classification

Classification

Possible reasons (application, network, database, business logic, etc.)

Supporting group (application group, database group, infrastructure group, network group, etc.)

Prioritize

Priority = Impact X Urgency

Determine resolution timeline (resolve within X hours) based on Service Level Agreement

14. Preliminary Support

Preliminary Response

Acknowledge of acceptance

Collect basic info

Provide basic help to the user

Service Requests

Service Request is standard service like check status, reset password, etc.

Go through standard procedure to handle service requests

15. Match

Match known errors

Known solution

Known workaround

Known resolution procedure

Match existing incidents

Link the new incident with the existing incidents

Increase the impact level of the existing incident

If the existing one is already worked on, inform the responsible personal/group

16. Investigate and Diagnosis

Escalation

Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers

Hierarchical escalation (Management escalation): Escalate to higher level management team

17. Escalation by Priorities

A (Service Desk)

B (Second Line)

C (Third Line, Supplier)

D (Incident Manager)

E (Division Management)

F (Corporate Management

C B A 8 hr 4 D C B A 6 hr 3 E,F D C B A 4 hr 2 EF CD B A 2 hr 1 100% timeline 60% timeline 30% timeline 10 Minute 0 Minute Resolution timeline Priority 18. Investigation Activities

Assign dedicated support person

Collect basic info

Query historical data

Recent releases

Recent changes

Workload trend

Analyze

Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!

19. Resolve and recover

Resolution (workarounds or permanent fix)

Create a Request For Change (RFC)

Approve RFC

Implement Change.

Record the analysis, the root cause, the workaround and the solution

Leave the incident in Open status when resolution hasn’t been found

20. Termination

Contact the user to confirm incident is resolved

Change the Incident status into “Closed”

Update all the Incident record to reflect the final priority, impact, user and root cause

21. Track and Monitor

Assign an owner to each incident. Usually it’s the Service Desk person.

Provide feedback to the users after a change

Enforce the escalation based on the priority

22. Problem Management

Problem Control

Find the root cause of a problem

Turn a problem into a Known Error

Error Control

Control and Monitor the Known Errors until they are appropriately handled

Proactive Problem Management

Resolve problems before they cause any incidents

23. Problem Control 24. Identify Problems

Analyze the trends of incidents

Likely to reoccur

Likely more will occur

Likely to have larger impact

Analyze the weakness of the infrastructure

Availability

Capability

A significant incident (outage)

25. Diagnosis

Recreate incident in testing environment

Link the modules with incidents

Review the latest changes

After the root cause of a problem is found, this problem becomes a Known Error

26. Temporary Fixes

It’s important to find a temporary fix if the problem causes significant incident

If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause)

For urgent problems, Emergency Change Request Process should be initialized.

27. Error Control 28. Identify and Record Known Error

Identify

Find the root cause of a problem

Link a problem with a known error

Record

Assign an ID

Symptoms

Root cause

Status

Notification

Notify incident management team. They can associate new incidents with known errors

29. Determine the solution

Evaluate based on

Service Level Agreement

Impact and Urgency

Cost and benefit

Possible solutions

Temporary fixes

Permanent fixes

No fix (cost is greater than benefits)

Record the decision in Problem Database

30. Known Errors from other environments

Known errors from development environment

We may choose to release with some minor known issues

Known errors from suppliers

Usually reported in the release notes

Record, Monitor and Track those known errors

Relate problems with those known errors

31. PIR (Post Implementation Review)

Normal problems

Confirm all the related incidents are closed

Verify if the problem record is complete (symptoms, root cause and solutions)

Change the problem status into Resolved

Significant problems

What went well?

What went wrong?

How to do better next time?

How to prevent the similar issues from happening again?

32. Track and Monitor

Track the full lifecycle of each known error

Reevaluate impact and urgency. Adjust the priorities accordingly.

Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.

33. Proactive Problem Management

Focus on the quality of the service and the infrastructure

Analyze operational trends

Detect the potential incidents and prevent them from happening

Find out the weak points of the infrastructure or the overloaded components

34. Ideas to improve our Production Support process

Idea 1: Create an independent Problem Management Team.

Idea 2: Create an Problem Database

Idea 3: Define the Production Support Procedure

Idea 4: Review and revise the procedures of using TeamTrack

Idea 5: Enforce Post Implementation Review

Idea 6: Proactively manage problems

Idea 7 (optional): Acquire an Service Desk software to facilitate the process

35. Create an independent Problem Management Team.

Can be a full time team or a part time team

Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different.

Responsible for managing all the production problems (not incidents) for multiple applications

Identify problems

Record problem

Find and evaluate solutions

Track the progress till closure

Work closely with the existing Production Support team.

36. Create a Problem Database

A easy to search knowledge database

Include problems and known errors

Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions

Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments

Maintained by the Problem Management Team

Will be used by Production Support team for match and fast resolution of incidents

37. Define the Production Support Procedure (Work Instructions)

Create a formal and detailed document. Train Production Support Team to follow the new procedure

Start with ITIL Incident Management Process. Adjust it to our own situation and tools

Clearly define how to calculate priorities

Clearly define the time-bound escalation procedure

Clearly define the monitoring and tracking steps

38. Review and define the procedure of using TeamTrack

TeamTrack is our existing Incident Tracking system

Review the functions of TeamTrack

Redefine the incident escalation process according to ITIL suggestions

Define the interface between PC Support and IT Production Support Team

Communication channel

Roles and responsibilities

Escalation

Track and Control

Knowledge sharing

39. Enforce PIR

Contact each user to confirm all the incidents are closed

Make sure the Problem record is complete and useful

Identify issues in the Incident and Problem Management process. Add those to Problem database.

40. Proactively Manage Problems

Responsibility of the Problem Management Team.

Perform the following activities:

Analyze incidents to find the trend

Analyze infrastructure to identify possible bottleneck

Run fail-over and stress tests

Apply a problem solution across multiple related applications

Establish and maintain the Production Monitor System to proactively detect system anomalies

Evaluate how many problems are proactively identified and resolved

41. Service Desk Software

Evaluate the existing TeamTrack software and see if it covers out needs

Other popular options

HP Openview Service Desk

Remedy Strategic Service Suite

CA Unicenter Service Desk

Its all about getting into it

Wednesday, July 4, 2018

ITIL in context with Production Support.

No comments:

Post a Comment

About Me