reference from the link : https://www.slideshare.net/linpeizhang/deal-with-production-issues-the-itil-way
1. Deal with Production Issues Suggestions from ITIL 2. Problems to solve- Long resolution time
- Neglected issues
- Issues we lose track of until our users remind us
- Recurring issues
- Inconsistency in response time
- Developers are distracted constantly to resolve issues
- Manage issues in a consistent manner
- Fast resolution
- Reduce client impact
- Proactively resolve issues before they impact clients
- Incidents
- Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service
- Problems
- A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms.
- Known Errors
- A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around
- Problem is the root cause of the incidents
- Incident is the manifest of a underline Problem
- One Problem can cause many Incidents
- Known error is a problem with known root cause and known workaround
- Different goals
- Incident Management focus on restoring the service operation as quickly as possible
- Problem management focus on finding and eliminating the root cause
- Different actions
- Incident management applies workarounds or temporary fixes to quickly restore the services
- Problem management issue a change to fundamentally eliminate the root cause
- Incident management is reactive and problem management is proactive
- Incident management emphasize speed and problem management emphasize quality
- Spend tremendous time and efforts to find root cause before the service level is recovered
- Stop the investigation after an incident is fixed by a workaround
- Same incident occurs repeatedly without understanding of the root cause
- Separate out Incident Management and Problem Management into two independent but related processes
- Handle incidents (restore service) as quickly as possible
- Proactively and independently work on resolving problems
- Wisely manage Known Errors
- Always remember the goal is to “Restore service level as quickly as possible ”
- How to go fast?
- Classification
- Match known errors and known workarounds
- Appropriate escalation
- Go fast, but not go crazy. Don’t miss
- Record
- Prioritize
- Follow up
- Benefits of recording
- Help to diagnosis new incidents based on known incidents
- Help Problem Management to find the root cause
- Easy to determine the impact
- Be able to track and control the issue resolution.
- Incident Reporting Channels
- User
- System Monitor/Alert
- IT person
- Unique ID
- Basic diagnosis info
- Timestamp
- Symptoms
- User info (name, contact info)
- Who’s responsible
- Additional information
- Screenshots
- Logs
- Status
- New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated
- Classification
- Possible reasons (application, network, database, business logic, etc.)
- Supporting group (application group, database group, infrastructure group, network group, etc.)
- Prioritize
- Priority = Impact X Urgency
- Determine resolution timeline (resolve within X hours) based on Service Level Agreement
- Preliminary Response
- Acknowledge of acceptance
- Collect basic info
- Provide basic help to the user
- Service Requests
- Service Request is standard service like check status, reset password, etc.
- Go through standard procedure to handle service requests
- Match known errors
- Known solution
- Known workaround
- Known resolution procedure
- Match existing incidents
- Link the new incident with the existing incidents
- Increase the impact level of the existing incident
- If the existing one is already worked on, inform the responsible personal/group
- Escalation
- Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers
- Hierarchical escalation (Management escalation): Escalate to higher level management team
- A (Service Desk)
- B (Second Line)
- C (Third Line, Supplier)
- D (Incident Manager)
- E (Division Management)
- F (Corporate Management
- Assign dedicated support person
- Collect basic info
- Query historical data
- Recent releases
- Recent changes
- Workload trend
- Analyze
- Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!
- Resolution (workarounds or permanent fix)
- Create a Request For Change (RFC)
- Approve RFC
- Implement Change.
- Record the analysis, the root cause, the workaround and the solution
- Leave the incident in Open status when resolution hasn’t been found
- Contact the user to confirm incident is resolved
- Change the Incident status into “Closed”
- Update all the Incident record to reflect the final priority, impact, user and root cause
- Assign an owner to each incident. Usually it’s the Service Desk person.
- Provide feedback to the users after a change
- Enforce the escalation based on the priority
- Problem Control
- Find the root cause of a problem
- Turn a problem into a Known Error
- Error Control
- Control and Monitor the Known Errors until they are appropriately handled
- Proactive Problem Management
- Resolve problems before they cause any incidents
- Analyze the trends of incidents
- Likely to reoccur
- Likely more will occur
- Likely to have larger impact
- Analyze the weakness of the infrastructure
- Availability
- Capability
- A significant incident (outage)
- Recreate incident in testing environment
- Link the modules with incidents
- Review the latest changes
- After the root cause of a problem is found, this problem becomes a Known Error
- It’s important to find a temporary fix if the problem causes significant incident
- If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause)
- For urgent problems, Emergency Change Request Process should be initialized.
- Identify
- Find the root cause of a problem
- Link a problem with a known error
- Record
- Assign an ID
- Symptoms
- Root cause
- Status
- Notification
- Notify incident management team. They can associate new incidents with known errors
- Evaluate based on
- Service Level Agreement
- Impact and Urgency
- Cost and benefit
- Possible solutions
- Temporary fixes
- Permanent fixes
- No fix (cost is greater than benefits)
- Record the decision in Problem Database
- Known errors from development environment
- We may choose to release with some minor known issues
- Known errors from suppliers
- Usually reported in the release notes
- Record, Monitor and Track those known errors
- Relate problems with those known errors
- Normal problems
- Confirm all the related incidents are closed
- Verify if the problem record is complete (symptoms, root cause and solutions)
- Change the problem status into Resolved
- Significant problems
- What went well?
- What went wrong?
- How to do better next time?
- How to prevent the similar issues from happening again?
- Track the full lifecycle of each known error
- Reevaluate impact and urgency. Adjust the priorities accordingly.
- Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.
- Focus on the quality of the service and the infrastructure
- Analyze operational trends
- Detect the potential incidents and prevent them from happening
- Find out the weak points of the infrastructure or the overloaded components
- Idea 1: Create an independent Problem Management Team.
- Idea 2: Create an Problem Database
- Idea 3: Define the Production Support Procedure
- Idea 4: Review and revise the procedures of using TeamTrack
- Idea 5: Enforce Post Implementation Review
- Idea 6: Proactively manage problems
- Idea 7 (optional): Acquire an Service Desk software to facilitate the process
- Can be a full time team or a part time team
- Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different.
- Responsible for managing all the production problems (not incidents) for multiple applications
- Identify problems
- Record problem
- Find and evaluate solutions
- Track the progress till closure
- Work closely with the existing Production Support team.
- A easy to search knowledge database
- Include problems and known errors
- Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions
- Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments
- Maintained by the Problem Management Team
- Will be used by Production Support team for match and fast resolution of incidents
- Create a formal and detailed document. Train Production Support Team to follow the new procedure
- Start with ITIL Incident Management Process. Adjust it to our own situation and tools
- Clearly define how to calculate priorities
- Clearly define the time-bound escalation procedure
- Clearly define the monitoring and tracking steps
- TeamTrack is our existing Incident Tracking system
- Review the functions of TeamTrack
- Redefine the incident escalation process according to ITIL suggestions
- Define the interface between PC Support and IT Production Support Team
- Communication channel
- Roles and responsibilities
- Escalation
- Track and Control
- Knowledge sharing
- Contact each user to confirm all the incidents are closed
- Make sure the Problem record is complete and useful
- Identify issues in the Incident and Problem Management process. Add those to Problem database.
- Responsibility of the Problem Management Team.
- Perform the following activities:
- Analyze incidents to find the trend
- Analyze infrastructure to identify possible bottleneck
- Run fail-over and stress tests
- Apply a problem solution across multiple related applications
- Establish and maintain the Production Monitor System to proactively detect system anomalies
- Evaluate how many problems are proactively identified and resolved
- Evaluate the existing TeamTrack software and see if it covers out needs
- Other popular options
- HP Openview Service Desk
- Remedy Strategic Service Suite
- CA Unicenter Service Desk
No comments:
Post a Comment