Wednesday, July 4, 2018

ITIL in context with Production Support.


reference from the link : https://www.slideshare.net/linpeizhang/deal-with-production-issues-the-itil-way

1. Deal with Production Issues Suggestions from ITIL 2. Problems to solve
  • Long resolution time
  • Neglected issues
    • Issues we lose track of until our users remind us
  • Recurring issues
  • Inconsistency in response time
  • Developers are distracted constantly to resolve issues
3. Goal
  • Manage issues in a consistent manner
  • Fast resolution
  • Reduce client impact
  • Proactively resolve issues before they impact clients
4. Basic Concepts
  • Incidents
    • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service
  • Problems
    • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms.
  • Known Errors
    • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around
5. Relationship of the three
  • Problem is the root cause of the incidents
  • Incident is the manifest of a underline Problem
  • One Problem can cause many Incidents
  • Known error is a problem with known root cause and known workaround
6. Manage Incident vs. Manage Problem
  • Different goals
    • Incident Management focus on restoring the service operation as quickly as possible
    • Problem management focus on finding and eliminating the root cause
  • Different actions
    • Incident management applies workarounds or temporary fixes to quickly restore the services
    • Problem management issue a change to fundamentally eliminate the root cause
  • Incident management is reactive and problem management is proactive
  • Incident management emphasize speed and problem management emphasize quality
7. Common mistakes
  • Spend tremendous time and efforts to find root cause before the service level is recovered
  • Stop the investigation after an incident is fixed by a workaround
  • Same incident occurs repeatedly without understanding of the root cause
8. Solutions from ITIL
  • Separate out Incident Management and Problem Management into two independent but related processes
  • Handle incidents (restore service) as quickly as possible
  • Proactively and independently work on resolving problems
  • Wisely manage Known Errors
9. Incident Management
  • Always remember the goal is to “Restore service level as quickly as possible ”
  • How to go fast?
    • Classification
    • Match known errors and known workarounds
    • Appropriate escalation
  • Go fast, but not go crazy. Don’t miss
    • Record
    • Prioritize
    • Follow up
10. Incident Management Process 11. Acceptance And Record
  • Benefits of recording
    • Help to diagnosis new incidents based on known incidents
    • Help Problem Management to find the root cause
    • Easy to determine the impact
    • Be able to track and control the issue resolution.
  • Incident Reporting Channels
    • User
    • System Monitor/Alert
    • IT person
12. Incident Record
  • Unique ID
  • Basic diagnosis info
    • Timestamp
    • Symptoms
    • User info (name, contact info)
    • Who’s responsible
  • Additional information
    • Screenshots
    • Logs
  • Status
    • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated
13. Classification
  • Classification
    • Possible reasons (application, network, database, business logic, etc.)
    • Supporting group (application group, database group, infrastructure group, network group, etc.)
  • Prioritize
    • Priority = Impact X Urgency
    • Determine resolution timeline (resolve within X hours) based on Service Level Agreement
14. Preliminary Support
  • Preliminary Response
    • Acknowledge of acceptance
    • Collect basic info
    • Provide basic help to the user
  • Service Requests
    • Service Request is standard service like check status, reset password, etc.
    • Go through standard procedure to handle service requests
15. Match
  • Match known errors
    • Known solution
    • Known workaround
    • Known resolution procedure
  • Match existing incidents
    • Link the new incident with the existing incidents
    • Increase the impact level of the existing incident
    • If the existing one is already worked on, inform the responsible personal/group
16. Investigate and Diagnosis
  • Escalation
    • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers
    • Hierarchical escalation (Management escalation): Escalate to higher level management team
17. Escalation by Priorities
  • A (Service Desk)
  • B (Second Line)
  • C (Third Line, Supplier)
  • D (Incident Manager)
  • E (Division Management)
  • F (Corporate Management
C B A 8 hr 4 D C B A 6 hr 3 E,F D C B A 4 hr 2 EF CD B A 2 hr 1 100% timeline 60% timeline 30% timeline 10 Minute 0 Minute Resolution timeline Priority 18. Investigation Activities
  • Assign dedicated support person
  • Collect basic info
  • Query historical data
    • Recent releases
    • Recent changes
    • Workload trend
  • Analyze
  • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!
19. Resolve and recover
  • Resolution (workarounds or permanent fix)
    • Create a Request For Change (RFC)
    • Approve RFC
    • Implement Change.
  • Record the analysis, the root cause, the workaround and the solution
  • Leave the incident in Open status when resolution hasn’t been found
20. Termination
  • Contact the user to confirm incident is resolved
  • Change the Incident status into “Closed”
  • Update all the Incident record to reflect the final priority, impact, user and root cause
21. Track and Monitor
  • Assign an owner to each incident. Usually it’s the Service Desk person.
  • Provide feedback to the users after a change
  • Enforce the escalation based on the priority
22. Problem Management
  • Problem Control
    • Find the root cause of a problem
    • Turn a problem into a Known Error
  • Error Control
    • Control and Monitor the Known Errors until they are appropriately handled
  • Proactive Problem Management
    • Resolve problems before they cause any incidents
23. Problem Control 24. Identify Problems
  • Analyze the trends of incidents
    • Likely to reoccur
    • Likely more will occur
    • Likely to have larger impact
  • Analyze the weakness of the infrastructure
    • Availability
    • Capability
  • A significant incident (outage)
25. Diagnosis
  • Recreate incident in testing environment
  • Link the modules with incidents
  • Review the latest changes
  • After the root cause of a problem is found, this problem becomes a Known Error
26. Temporary Fixes
  • It’s important to find a temporary fix if the problem causes significant incident
  • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause)
  • For urgent problems, Emergency Change Request Process should be initialized.
27. Error Control 28. Identify and Record Known Error
  • Identify
    • Find the root cause of a problem
    • Link a problem with a known error
  • Record
    • Assign an ID
    • Symptoms
    • Root cause
    • Status
  • Notification
    • Notify incident management team. They can associate new incidents with known errors
29. Determine the solution
  • Evaluate based on
    • Service Level Agreement
    • Impact and Urgency
    • Cost and benefit
  • Possible solutions
    • Temporary fixes
    • Permanent fixes
    • No fix (cost is greater than benefits)
  • Record the decision in Problem Database
30. Known Errors from other environments
  • Known errors from development environment
    • We may choose to release with some minor known issues
  • Known errors from suppliers
    • Usually reported in the release notes
  • Record, Monitor and Track those known errors
  • Relate problems with those known errors
31. PIR (Post Implementation Review)
  • Normal problems
    • Confirm all the related incidents are closed
    • Verify if the problem record is complete (symptoms, root cause and solutions)
    • Change the problem status into Resolved
  • Significant problems
    • What went well?
    • What went wrong?
    • How to do better next time?
    • How to prevent the similar issues from happening again?
32. Track and Monitor
  • Track the full lifecycle of each known error
    • Reevaluate impact and urgency. Adjust the priorities accordingly.
    • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.
33. Proactive Problem Management
  • Focus on the quality of the service and the infrastructure
  • Analyze operational trends
  • Detect the potential incidents and prevent them from happening
  • Find out the weak points of the infrastructure or the overloaded components
34. Ideas to improve our Production Support process
  • Idea 1: Create an independent Problem Management Team.
  • Idea 2: Create an Problem Database
  • Idea 3: Define the Production Support Procedure
  • Idea 4: Review and revise the procedures of using TeamTrack
  • Idea 5: Enforce Post Implementation Review
  • Idea 6: Proactively manage problems
  • Idea 7 (optional): Acquire an Service Desk software to facilitate the process
35. Create an independent Problem Management Team.
  • Can be a full time team or a part time team
  • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different.
  • Responsible for managing all the production problems (not incidents) for multiple applications
    • Identify problems
    • Record problem
    • Find and evaluate solutions
    • Track the progress till closure
  • Work closely with the existing Production Support team.
36. Create a Problem Database
  • A easy to search knowledge database
  • Include problems and known errors
  • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions
  • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments
  • Maintained by the Problem Management Team
  • Will be used by Production Support team for match and fast resolution of incidents
37. Define the Production Support Procedure (Work Instructions)
  • Create a formal and detailed document. Train Production Support Team to follow the new procedure
  • Start with ITIL Incident Management Process. Adjust it to our own situation and tools
  • Clearly define how to calculate priorities
  • Clearly define the time-bound escalation procedure
  • Clearly define the monitoring and tracking steps
38. Review and define the procedure of using TeamTrack
  • TeamTrack is our existing Incident Tracking system
    • Review the functions of TeamTrack
    • Redefine the incident escalation process according to ITIL suggestions
  • Define the interface between PC Support and IT Production Support Team
    • Communication channel
    • Roles and responsibilities
    • Escalation
    • Track and Control
    • Knowledge sharing
39. Enforce PIR
  • Contact each user to confirm all the incidents are closed
  • Make sure the Problem record is complete and useful
  • Identify issues in the Incident and Problem Management process. Add those to Problem database.
40. Proactively Manage Problems
  • Responsibility of the Problem Management Team.
  • Perform the following activities:
    • Analyze incidents to find the trend
    • Analyze infrastructure to identify possible bottleneck
    • Run fail-over and stress tests
    • Apply a problem solution across multiple related applications
    • Establish and maintain the Production Monitor System to proactively detect system anomalies
  • Evaluate how many problems are proactively identified and resolved
41. Service Desk Software
  • Evaluate the existing TeamTrack software and see if it covers out needs
  • Other popular options
    • HP Openview Service Desk
    • Remedy Strategic Service Suite
    • CA Unicenter Service Desk

No comments:

Post a Comment