Page 1 of 1

Major Incident Problem Communication

Posted: Tue Apr 09, 2019 11:39 am
by Operations-Automated
Tool: Service Now
So new company, new problems.

Current Scenario:
MI has been raised and resolved, however instead of closing the incident there is a sub state of awaiting RFO/RCA. A Problem ticket is also raised.
This means that until the RCA has been completed the ticket will remain open regardless if all the children incidents have been resolved.
This way, once the MI RCA has been completed, a communication of RCA can filter through to the children.
This also catches any chasers (as the customer will email their INC) and no new tickets and admin is required.

For me this is not the ITIL way. INCs should be resolved (with the view to close) as soon as the customer is back to BAU. Leaving tickets open for an RCA will increase MTTR and other such metrics.

MI has been raised, resolved and will be closed pending a 3 day waiting period. A problem ticket has been raised to investigate the RCA.
Once the RCA in the PRB has been found this can then send the communication to the customer or a mechanism can be found to send the report.

The biggest arguments I am getting with this proposal is that this will create admin as customers will email in creating new tickets and that currently we con't have a mechanism outside of an MI (on INC) to communicate to a whole user set.

My questions are:
If this was a Major Service Outage (or a multi tenanted issue) of lets say 500 customers, how would you catch the chasers coming into the Service Desk without creating a huge admin headache of requests?

What do you think the best mechanism within Service Now to communicate a RCA/RFO report to customers in regards to the MI?

Re: Major Incident Problem Communication

Posted: Fri Apr 12, 2019 7:28 am
by tedderpd
I'm not a S-N expert, but from a process perspective, I would agree that the incident record should not remain open, with a status of "pending RCA". When the consumer/customer is returned to a state of acceptable service, the incident is resolved. To record the need to do a RCA, a problem record should be opened, with any associated "child" incidents attached to that problem record.

To catch the "stragglers", one approach could be to record a known error record containing the circumstances of the incident, the resolution that was taken, and the identifier of the problem record. The new "straggler" incident record would then be attached to the problem record.

Once root cause is determined and the change resolving the root cause is successfully implemented, the problem record would be updated with that information. Most ITSM tools have the functionality that when a problem record status is set to "closed", then the status of all associated "child" incident records are updated from "resolved" to "closed".

To be clear - MTTR is the measure of the elapsed time from logging of the incident to resolution of the incident. It is not a measure the time from logged to closed.

Re: Major Incident Problem Communication

Posted: Wed Apr 17, 2019 11:40 pm
by Corde Wagner
Spot on @tedderpd, I agree with you and also with the approach of not allowing the Incident record to remain open indefinitely.

As an add-on approach to this situation, I borrowed from ITIL v2 the concept of "restore" as a point where the impacted users are able to resume their work and that identified the incident is over. From "detect" to "restored", that becomes my time to restore and averaged as mean-time-to-restore-service (MTTRS). Then between "restore" and following all "recovery" activities is where "resolved" is declared and that give an additional measure of how long it takes from restore, to recover and then to resolve.

The time to recover and get to "resolved" is often left hanging because undisciplined technical teams are off to the next fire and the keys of recovery put off or even forgotten. Pushing to resolution may keep the incident ticket open, but as "resolved" is really what matters most in this scenario, it's more a win/win to keep it open as a measurement of recovery (which may not take any time really, or a month to RMA some hardware component).


Re: Major Incident Problem Communication

Posted: Thu Apr 18, 2019 6:04 am
by tedderpd
Thanks Corde, I appreciate your feedback and insight. Your suggestions would help draw attention to the scenario you describe - the lack of follow through on actually getting to root cause. This would help stakeholders understand why its important to focus on the problem - and differentiate it from the incident. If the organization chooses not to resolve the problem - so be it - but that is a very different issue than how well incident management is performing.

Doug Tedder

Re: Major Incident Problem Communication

Posted: Wed Jun 19, 2019 8:14 am
by Operations-Automated
Thank you all for your responses and it looks as though we are all on a similar view.

Have a great one