Mean , Median and Mode for MTTR ?!

An open discussion on issues related directly or primarily to the service or help desk.
Post Reply
User avatar
e3d
Newbie
Newbie
Posts: 3
Joined: Wed Sep 08, 2010 8:00 pm

Thu Sep 09, 2010 2:45 pm

I’m currently working on re-evaluating the current KPIs for IT Service Desk in my company and stopped at calculating the MTTR.

Why do we calculate MTTR based on the Mean value ? and not based on Mode or Median values :roll: ?

I know the difference between them statistically and the pros vs cons of each one , but I wanted to query based on ITIL best practices point of view.

Would you please advise why do you use the Mean value instead of other values (Median or Mode) while reporting against achieved repair or SLA time based on ITIL best practices ?

And is there any other best practices in calculating these values for weekly\annual reporting (small \big data sample) ?


User avatar
Diarmid
ITIL Expert
ITIL Expert
Posts: 1894
Joined: Mon Mar 03, 2008 7:00 pm
Location: Helensburgh

Fri Sep 10, 2010 8:32 am

Does ITIL have a view on this?

You collect and consolidate statistics for a purpose, normally either to compare the figures or to be able to act on them. In this context we are talking about rolling figures - or at least I would be; how does this impact on the usefulness of the different statistical averages?

Presumably you do not expect your customers to be statisticians? One obvious value of the mean is that the total time lost in a period is simply derived from the mean and the number of incidents occurring.

Best practice is addressing the prescriptions in the SLA. you negotiate the SLA with the customer and that is the time to be sure that everyone understands what is being agreed. The reporting just follows this agreement.

I doubt if it is ever good to report on data samples in such circumstances. Did you mean data sets? It is common sense that small data sets are not so meaningfully represented by averaging of any kind. So, if you only have twenty incidents in a month, it is probably better to itemise them in detail and prevent the kind of misleading interpretations that variation within them can engender when applying statistical analysis. It is no consolation that the average restoration time was fifteen minutes when nineteen of them each took two minutes and the other took... well you work it out, and was the only incident affecting a critical service.
"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
User avatar
e3d
Newbie
Newbie
Posts: 3
Joined: Wed Sep 08, 2010 8:00 pm

Sat Sep 11, 2010 10:05 am

Dear Diarmid,

Thank you for your reply.

In this context im talking about the statistics to compare the figures and be able to act on them.

let me clarify what i was looking for and why ?

Rencently we finished rolling out a project and we generated a report to measure the MTTR for the project's related incidents resolved by different resolver groups.

The data set was 827 incidents created within 2 months duration.

I noticed that the MeanTTR time for incidents resolved at IT Service Desk was too high (3 Hours) and didn't reflect the correct performance of the team and when i digged into the data to understand the reason , i found out that this was due some outlier values which affected the overall MTTR since the Mean value is highly affected by outlier values.

When i used the MedianTTR for the same incidents i got a reasonable number (28 Minutes) which refeclected more accurate picture of IT Service Desk performance because the Median value is resistance to outlier values.

Here where i started searching for the best practices on when to use each of the average statistical methods ( Median , Mean and Mode ) ?

In most (if not all ) of the reports i found Mean value was used , because ITIL described the MTTR based on the mean value , but i didn't know why? or how do they deal with outlier values ?

Just wanted to use your experties on how do you report MTTR ? and how do you deal with outlier values to give accurate snapshot on the performance ?
User avatar
Diarmid
ITIL Expert
ITIL Expert
Posts: 1894
Joined: Mon Mar 03, 2008 7:00 pm
Location: Helensburgh

Sat Sep 11, 2010 10:30 am

e3d,

what you describe, is not a question of best practice, it is a q1uestion of two completely different information perspectives. Speaking loosely, the median of twenty-eight minutes tells you about what you usually achieve and the mean of three hours tells you how much lost service time there has been. It is not an issue of one being better practice than the other. In this case the median is probably useful for your internal assessment of your staff and the mean indicates your overall service levels achieved, something your customer is most interested in.

The obvious way to deal with outlier values is to itemize them, give them their context (why they occurred; what, if anything you are doing to prevent them; etc.) and then offer, as a second figure, the mean with them excluded. I would suggest not inflicting concepts like median on your customer, unless they are very savvy about statistics, because, otherwise, they will probably feel you are trying to hide something.

It is a coincidence (or a devious devise of some long forgotten statistician) that Mean Median and Mode all start with the letter M. In my experience the acronyms MTTF and MTBF are used quite specifically for "mean time..." and not as some clever way of saying "some form of average beginning with M" and I would not recommend using the acronyms in any other way lest you cause confusion.
"Method goes far to prevent trouble in business: for it makes the task easy, hinders confusion, saves abundance of time, and instructs those that have business depending, both what to do and what to hope."
William Penn 1644-1718
User avatar
e3d
Newbie
Newbie
Posts: 3
Joined: Wed Sep 08, 2010 8:00 pm

Sat Sep 11, 2010 11:13 am

Diarmid,

I believe i got my answer now ,appreciate your input
:)
User avatar
Marcel
Senior Itiler
Senior Itiler
Posts: 63
Joined: Wed Sep 20, 2006 8:00 pm
Location: USA

Wed Sep 15, 2010 3:05 pm

I would like to add an additional word of caution for using any type of average, such as MTTR. You will need to first figure out what problem you are trying to solve.

If all you are interested in is to know the average duration of a service disruption, then MTTR is just fine. As Diarmid pointed out, MTTR times the number of incidents gives you the total duration of disruptions. Maybe all you want is just to calculate the average cost per incident based on the MTTR and a cost per minute. Perfectly valid use of MTTR.

If you want to guarantee some type of service level, then I would steer clear of averages. Averages allow for a great deal of variance, or in other words inconsistency in the performance of your process. Your customer will likely expect that you consistently resolve incidents within a certain time over a specific period of time. So your service level would be something like: Resolve 95% of incidents within 4 hours based on monthly measurements (and maybe different service levels for different incident priorities).

The percentage says something about the level of consistency you intend to achieve. The number of hours says something about the speed you are aiming for. The monthly measurement adds to the consistency. If instead you were to measure on an annual basis then it introduces the possibility for you to resolve every incident on time for the first 51 weeks of the year and go on vacation the week of Christmas with extremely crappy service as a result. Chances are your customer won't be happy with that even though 95% of incidents was resolved on time. When using this approach, an outlier value here and there won't impact your SLA. Now of course your customer may still not be satisfied with that risk and you can add something to your SLA to address that (e.g. add that 99% of incidents will be resolved within 1 day based on monthly measurements). That helps to further 'cap' the accepted resolution time. And if you're really confident, feel free to offer 100%, but you would probably be one of the few service providers comfortable offering that.
Post Reply