Saturday, 2 August 2014

Which Average Should I Use? Skewed Statistics 1

Which average should I use?


Statistics are quoted all the time in the media. However, sometimes the media, accidentally or otherwise, can skew statistics such that they lie. In many of these cases the statistic is technically correct, but presented in a misleading manner. Skewed Statistics is series of articles, posted on Saturdays, about how different types of statistics should be used, and how they are misused.

Averages are one of the most common types of statistic used. We all probably remember being taught about the different types of averages in school: mean, median, mode, and range. But do we actually know when we should use each type?

Let's be curious and ask, 'which average should I use?'

Mean

Perhaps so named as it requires the most maths, this is what most people think when they hear the word 'average'. You add up all the numbers in the set and divide by how many numbers were in the set. If the data are presented in a frequency table, then you multiply each number by its frequency add all those up and divide by the total of all the frequencies. For example:

Score
Frequency
1
2
2
5
3
8
4
7
5
3
6
1

In this example you would do: (1x2) + (2x5) + (3x8) + (4x7) + (5x3) + (6x1) = 85
Then work out the total of the frequencies: 2+5+8+7+3+1=26
Divide the two numbers: 85/26 = 3.27 would be the mean.

So, when should you use a mean?

The mean is best used when there isn't much spread in the data. For example if you wanted to work out the average time taken for you to make an omelette you'd use the mean. Assuming you don't have any extreme values, your times are going to be pretty similar and fall within a smallish range. However, if you were distracted by a phone call whilst making one omelette and the time goes to 15 minutes compared to the 2 minutes for the others, you would discard the 15 minute as it isn't representative of the sample.

Median

With a median you put all the data in ascending order and find the middle. The technique for doing this on paper is to cross out the first, then the last, then the second, then the last-but-one, etc... until you are left with just one number. If you have an even number of data points, then you will be left with two values in the middle. In this case add them together and divide by two in order to find the halfway point between them.
For example: 5, 8, 6, 10, 54, 1, 6, 6, 23, 15, 45, 32, 27, 14, 13, 2
Would be ordered: 1, 2, 5, 6, 6, 6, 8, 10, 13, 14, 15, 23, 27, 32 45, 54
The middle numbers are 10 and 13, so (10+13)/2 = 11.5 would be the median.
So, when should you use the median?
The median is best used when there is spread in the data. For example if you wanted to work out the average household income. Some households will have a very high income, and others a very low. The trouble is, if you were to use the mean, one household having an income in the millions would skew the results. However, using the median finds the middle point with 50 % being above the median, and 50 % below.

Mode

This is the easiest of all averages. It is the value that occurs most often. Using the numbers from the median example above, 6 appears the most frequently and so is the mode. Sometimes, there is no mode if every number occurs equally frequently, and likewise you can have multiple modes if several numbers have the same frequency.
So, when should you use the mode?
The mode is best used when a 'voting' style is needed. In other words when you are only interested in what occurs most frequently, and less about everything else. For example you might ask what people's favourite drink is. If coffee had the most responses you would say coffee is the favourite. As we will see later, this can have some problems, though.

Range

There are two ways of expressing a range of numbers. You can state the lowest and highest values, as in the range is between 10 and 15. Alternatively, you can state the size of a range, as in the range is 5 (15-10=5).
Both types have their uses. The first type is good if you're interested in the actual values, and the latter if you're interested in the spread (though there are better ways of expressing spread such as standard deviation).

Skewing Averages

So, how can these averages be used to mislead? Well, the most common way is probably not stating what average you mean (no pun intended). For example if I say the average test score in a class is 60 %, that doesn't really tell you much. It could be that everybody got a unique score between 0 and 50 %, and two people got 60 %, and I'm just quoting the mode. Or am I quoting the mean? If four people got 50 % and one person got 100 %, is it really correct to say the average is 60? If I did the median on that particular example it would give me 50 %. As there is an extreme in the data, the median would be best. If, however, the results were evenly spread, then the mean would be best.
Perhaps the biggest culprit of using misleading averages are broadband companies. Whilst practices have been improving, we have all seen adverts saying that average customers receive up to 20 Mb or something like that. First of all, the company would be correct to state that all customers receive up to 20 Mb. 'Up to' is a very misleading term. It essentially tells us the range is 0-20 - not very useful at all.
Then we have the 'average customer' statement. Well what does that mean? For broadband speeds, which vary widely depending on area, the median should be used. That way we can see the half-way point. If they're using the mean then the people in the best areas will skew the average up.
Let's think about average household income again. As previously mentioned, the median should really be used for these statistics, but the media doesn't always do that. If we pick a random street with 6 households (hypothetical) on it. The household incomes are:

Household Income (£)
A 42,000
B 46,000
C 51,000
D 40,500
E 38,000
F 58,500

"RichardBransonSanDiego8Jul13" by BingNorton - Own work.
Licensed under CC BY-SA 3.0 via Wikimedia Commons.
In this case the median is £44,000 and the mean is £46,000. As the data is evenly spread it doesn't really matter which is used. But let's say Household D moves out. Richard Branson moves in. Let's say his annual income is £50 million (a guess - it's probably higher!).

The new mean is £8,372,583. Is that a good average to use? No! The median, however is still only £48,500. That is a more accurate representation of the street.


Whenever you read a statistic in the paper, or here one on the news you should ask yourself 'how did they get that?' For averages you should ask 'which average did they use?' and 'is that the best choice?'


Curious Joke

3 statisticians were out hunting. The first missed the target by 10 metres to the left. The second missed by 10 metres to the right. The third statistician shouted, 'Got It!'


Do you have any example of misleading averages in the media? Do you have anything to add? Let me know in the comments below. As ever, you can share this post using the social media tools below and to the left. You can follow it is all science using the buttons on the right.

Remember, it is all science. So let's be curious.

2 comments:

  1. "Lies, damned lies, and statistics" is part of a phrase attributed to Benjamin Disraeli. " The statement refers to the persuasive power of numbers, the use of statistics to bolster weak arguments, and the tendency of people to disparage statistics that do not support their positions.

    I have observed that both the media and advertisers like to use percentages to try and get their message across. But they never give you the context for the figures, so the stats are misleading.

    For example, suppose a shopkeeper raised the price of a jacket by 50 percent, and then discounted it by 50 percent. The price would not be the same price as it started! Suppose the jacket cost £100. After the 50 percent increase, it would cost £150. Take 50 percent off of £150 and you get a price of £75, rather than the original £100.

    Another common method of misleading with percentages: Suppose the murder rate in a city went down 50 percent in one year, and another 30 percent the following year. Did the rate go down by 80 percent over the two-year period? No! Suppose the initial rate was 100 murders for every 100,000 residents. After a 50 percent reduction, the rate is 50 murders for every 100,000 residents. Another thirty percent reduction means an additional .3 x 50 = 15 per 100,000 reduction. The final rate is then 50 - 15=35 murders for every 100,000 residents. On the other hand, an 80 percent .reduction of the original rate is .8 x 100 = 80 fewer deaths for each 100,000 residents, for a final rate of 20 murders for every 100,000 residents. One shouldn't simply add the percentages.

    ReplyDelete
    Replies
    1. Percentages will be a topic in a future Skewed Statistics post for exactly those reasons. They are ostensibly simple, but the media will often take advantage of that by manipulating them. I might mention your comment in the post.

      Delete

Google+