3.4: Measures of the Location of the Data

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

The common measures of location are quartiles and percentiles. Quartiles are special percentiles. The first quartile, Q1, is the same as the 25th percentile, and the third quartile, Q3, is the same as the 75th percentile. The median, M, is called both the second quartile and the 50th percentile.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75th percentile. That translates into a score of at least 1220.

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the "center" of the data. You can think of the median as the "middle value," but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data.

1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1

Ordered from smallest to largest:

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

[dfrac{6.8+7.2}{2} = 7]

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5

The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two.

1; 1; 2; 2; 4; 6; 6.8

The number two, which is part of the data, is the first quartile. One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.

The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).

[IQR = Q_3 – Q_1 ag{2.4.1}]

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile. Potential outliers always require further investigation.

Definition: Outliers

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.

Example 2.4.1

For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.

389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000

Order the data from smallest to largest.

114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000

[M = 488,800 onumber]

[Q_{1} = dfrac{230,500 + 387,000}{2} = 308,750 onumber]

[Q_{3} = dfrac{639,000 + 659,000}{2} = 649,000 onumber]

[IQR = 649,000 - 308,750 = 340,250 onumber]

[(1.5)(IQR) = (1.5)(340,250) = 510,375 onumber]

[Q_{1} - (1.5)(IQR) = 308,750 - 510,375 = –201,625 onumber]

[Q_{3} + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375 onumber]

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.

Exercise 2.4.1

For the following 11 salaries, calculate the IQR and determine if any salaries are outliers. The salaries are in dollars.

$33,000;$64,500; $28,000;$54,000; $72,000;$68,500; $69,000;$42,000; $54,000;$120,000; $40,500 Answer Order the data from smallest to largest.$28,000; $33,000;$40,500; $42,000;$54,000; $54,000;$64,500; $68,500;$69,000; $72,000;$120,000

Median = $54,000 [Q_{1} =$40,500 onumber]

[Q_{3} = $69,000 onumber] [IQR =$69,000 - $40,500 =$28,500 onumber]

[(1.5)(IQR) = (1.5)($28,500) =$42,750 onumber]

[Q_{1} - (1.5)(IQR) = $40,500 -$42,750 = -$2,250 onumber] [Q_{3} + (1.5)(IQR) =$69,000 + $42,750 =$111,750 onumber]

No salary is less than –$2,250. However,$120,000 is more than $11,750, so$120,000 is a potential outlier.

Example 2.4.2

For the two data sets in the test scores example, find the following:

1. The interquartile range. Compare the two interquartile ranges.
2. Any outliers in either set.

The five number summary for the day and night classes is

MinimumQ1MedianQ3Maximum
Day325674.582.599
Night25.578818998
1. The IQR for the day group is (Q_{3} - Q_{1} = 82.5 - 56 = 26.5)

The IQR for the night group is (Q_{3} - Q_{1} = 89 - 78 = 11)

The interquartile range (the spread or variability) for the day class is larger than the night class IQR. This suggests more variation will be found in the day class’s class test scores.

2. Day class outliers are found using the IQR times 1.5 rule. So,
• (Q_{1} - IQR(1.5) = 56 – 26.5(1.5) = 16.25)
• (Q_{3} + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25)

Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers.

Night class outliers are calculated as:

• (Q_{1} - IQR (1.5) = 78 – 11(1.5) = 61.5)
• (Q_{3} + IQR(1.5) = 89 + 11(1.5) = 105.5)

For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier.

Exercise 2.4.2

Find the interquartile range for the following two data sets and compare them.

Test Scores for Class A

69; 96; 81; 79; 65; 76; 83; 99; 89; 67; 90; 77; 85; 98; 66; 91; 77; 69; 80; 94

Test Scores for Class B

90; 72; 80; 92; 90; 97; 92; 75; 79; 68; 70; 80; 99; 95; 78; 73; 71; 68; 95; 100

Class A

Order the data from smallest to largest.

65; 66; 67; 69; 69; 76; 77; 77; 79; 80; 81; 83; 85; 89; 90; 91; 94; 96; 98; 99

(Median = dfrac{80 + 81}{2}) = 80.5

(Q_{1} = dfrac{69 + 76}{2} = 72.5)

(Q_{3} = dfrac{90 + 91}{2} = 90.5)

(IQR = 90.5 - 72.5 = 18)

Class B

Order the data from smallest to largest.

68; 68; 70; 71; 72; 73; 75; 78; 79; 80; 80; 90; 90; 92; 92; 95; 95; 97; 99; 100

(Median = dfrac{80 + 80}{2} = 80)

(Q_{1} = dfrac{72 + 73}{2} = 72.5)

(Q_{3} = dfrac{92 + 95}{2} = 93.5)

(IQR = 93.5 - 72.5 = 21)

The data for Class B has a larger IQR, so the scores between Q3 and Q1 (middle 50%) for the data for Class B are more spread out and not clustered about the median.

Example 2.4.3

Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were:

AMOUNT OF SLEEP PER SCHOOL NIGHT (HOURS)FREQUENCYRELATIVE FREQUENCYCUMULATIVE RELATIVE FREQUENCY
420.040.04
550.100.14
670.140.28
7120.240.52
8140.280.80
970.140.94
1030.061.00

Find the 28th percentile. Notice the 0.28 in the "cumulative relative frequency" column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28th percentile is between the last six and the first seven. The 28th percentile is 6.5.

Find the median. Look again at the "cumulative relative frequency" column and find 0.52. The median is the 50th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50th percentile is between the 25th, or seven, and 26th, or seven, values. The median is seven.

Find the third quartile. The third quartile is the same as the 75th percentile. You can "eyeball" this answer. If you look at the "cumulative relative frequency" column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75th percentile, then, must be an eight. Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.)

Exercise 2.4.3

Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65th percentile.

Amount of time spent on route (hours)FrequencyRelative FrequencyCumulative Relative Frequency
2120.300.30
3140.350.65
4100.250.90
540.101.00

The 65th percentile is between the last three and the first four.

The 65th percentile is 3.5.

Example 2.4.4

Using Table:

1. Find the 80th percentile.
2. Find the 90th percentile.
3. Find the first quartile. What is another name for the first quartile?

Solution

Using the data from the frequency table, we have:

1. The 80th percentile is between the last eight and the first nine in the table (between the 40th and 41st values). Therefore, we need to take the mean of the 40th an 41st values. The 80th percentile (= dfrac{8+9}{2} = 8.5)
2. The 90th percentile will be the 45th data value (location is (0.90(50) = 45)) and the 45th data value is nine.
3. Q1 is also the 25th percentile. The 25th percentile location calculation: (P_{25} = 0.25(50) = 12.5 approx 13) the 13th data value. Thus, the 25th percentile is six.

Exercise 2.4.4

Refer to the Table. Find the third quartile. What is another name for the third quartile?

The third quartile is the 75th percentile, which is four. The 65th percentile is between three and four, and the 90th percentile is between four and 5.75. The third quartile is between 65 and 90, so it must be four.

COLLABORATIVE STATISTICS

Your instructor or a member of the class will ask everyone in class how many sweaters they own. Answer the following questions:

1. How many students were surveyed?
2. What kind of sampling did you do?
3. Construct two different histograms. For each, starting value = _____ ending value = ____.
4. Find the median, first quartile, and third quartile.
5. Construct a table of the data to find the following:
1. the 10th percentile
2. the 70th percentile
3. the percent of students who own less than four sweaters

A Formula for Finding the kth Percentile

If you were to do a little research, you would find several formulas for calculating the kth percentile. Here is one of them.

• (k =) the kth percentile. It may or may not be part of the data.
• (i =) the index (ranking or position of a data value)
• (n =) the total number of data

Order the data from smallest to largest.

Calculate (i = dfrac{k}{100}(n + 1))i=k100(n+1)

If (i) is an integer, then the (k^{th}) percentile is the data value in the (i^{th}) position in the ordered set of data.

If (i) is not an integer, then round (i) up and round (i) down to the nearest integers. Average the two data values in these two positions in the ordered data set. This is easier to understand in an example.

Example 2.4.5

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.

18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

1. Find the 70th percentile.
2. Find the 83rd percentile.

Solution

• (k = 70)
• (i) = the index
• (n = 29)
(i = dfrac{k}{100})(n + 1) = dfrac{70}{100}(29 + 1) = 21). Twenty-one is an integer, and the data value in the 21st position in the ordered data set is 64. The 70th percentile is 64 years.
• (k) = 83rd percentile
• (i = the index)
• (n = 29)
(i = dfrac{k}{100}(n + 1) = (dfrac{83}{100})(29 + 1) = 24.9), which is NOT an integer. Round it down to 24 and up to 25. The age in the 24th position is 71 and the age in the 25th position is 72. Average 71 and 72. The 83rd percentile is 71.5 years.

Exercise 2.4.5

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.

18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

Calculate the 20th percentile and the 55th percentile.

(k = 20). Index (= i = dfrac{k}{100}(n+1) = dfrac{20}{100}(29 + 1) = 6). The age in the sixth position is 27. The 20th percentile is 27 years.

(k = 55). Index (= i = dfrac{k}{100}(n+1) = dfrac{55}{100}(29 + 1) = 16.5). Round down to 16 and up to 17. The age in the 16th position is 52 and the age in the 17th position is 55. The average of 52 and 55 is 53.5. The 55th percentile is 53.5 years.

Note 2.4.2

You can calculate percentiles using calculators and computers. There are a variety of online calculators.

A Formula for Finding the Percentile of a Value in a Data Set

• Order the data from smallest to largest.
• (x =) the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile.
• (y =) the number of data values equal to the data value for which you want to find the percentile.
• (n =) the total number of data.
• Calculate (dfrac{x + 0.5y}{n}(100)). Then round to the nearest integer.

Example 2.4.6

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.

18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

1. Find the percentile for 58.
2. Find the percentile for 25.

Solution

1. Counting from the bottom of the list, there are 18 data values less than 58. There is one value of 58.

(x = 18) and (y = 1). (dfrac{x + 0.5y}{n}(100) = dfrac{18 + 0.5(1)}{29}(100) = 63.80). 58 is the 64th percentile.

2. Counting from the bottom of the list, there are three data values less than 25. There is one value of 25.

(x = 3) and (y = 1). (dfrac{x + 0.5y}{n}(100) = dfrac{3 + 0.5(1)}{29}(100) = 12.07). Twenty-five is the 12thpercentile.

Exercise 2.4.6

Listed are 30 ages for Academy Award winning best actors in order from smallest to largest.

18; 21; 22; 25; 26; 27; 29; 30; 31, 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77

Find the percentiles for 47 and 31.

Percentile for 47: Counting from the bottom of the list, there are 15 data values less than 47. There is one value of 47.

(x = 15) and (y = 1). (dfrac{x + 0.5y}{n}(100) = dfrac{15 + 0.5(1)}{29}(100) = 53.45). 47 is the 53rd percentile.

Percentile for 31: Counting from the bottom of the list, there are eight data values less than 31. There are two values of 31.

(x = 15) and (y = 2). (dfrac{x + 0.5y}{n}(100) = dfrac{15 + 0.5(2)}{29}(100) = 31.03). 31 is the 31st percentile.

Interpreting Percentiles, Quartiles, and Median

A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15% of data values are less than or equal to the 15th percentile.

• Low percentiles always correspond to lower data values.
• High percentiles always correspond to higher data values.

A percentile may or may not correspond to a value judgment about whether it is "good" or "bad." The interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered "good;" in other contexts a high percentile might be considered "good". In many situations, there is no value judgment that applies.

Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text.

GUIDELINE

When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information.

• information about the context of the situation being considered
• the data value (value of the variable) that represents the percentile
• the percent of individuals or items with data values below the percentile
• the percent of individuals or items with data values above the percentile.

Example 2.4.7

On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.

• Twenty-five percent of students finished the exam in 35 minutes or less.
• Seventy-five percent of students finished the exam in 35 minutes or more.
• A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)

Exercise 2.4.7

For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret the third quartile in the context of the situation.

Twenty-five percent of runners finished the race in 11.5 seconds or more. Seventy-five percent of runners finished the race in 11.5 seconds or less. A lower percentile is good because finishing a race more quickly is desirable.

Example 2.4.8

On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret the 70th percentile in the context of this situation.

• Seventy percent of students answered 16 or fewer questions correctly.
• Thirty percent of students answered 16 or more questions correctly.
• A higher percentile could be considered good, as answering more questions correctly is desirable.

Exercise 2.4.8

On a 60 point written assignment, the 80th percentile for the number of points earned was 49. Interpret the 80th percentile in the context of this situation.

Eighty percent of students earned 49 points or fewer. Twenty percent of students earned 49 or more points. A higher percentile is good because getting more points on an assignment is desirable.

Example 2.4.9

At a community college, it was found that the 30th percentile of credit units that students are enrolled for is seven units. Interpret the 30th percentile in the context of this situation.

• Thirty percent of students are enrolled in seven or fewer credit units.
• Seventy percent of students are enrolled in seven or more credit units.
• In this example, there is no "good" or "bad" value judgment associated with a higher or lower percentile. Students attend community college for varied reasons and needs, and their course load varies according to their needs.

Exercise 2.4.9

During a season, the 40th percentile for points scored per player in a game is eight. Interpret the 40th percentile in the context of this situation.

Forty percent of players scored eight points or fewer. Sixty percent of players scored eight points or more. A higher percentile is good because getting more points in a basketball game is desirable.

Example 2.4.10

Sharpe Middle School is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students are shown.

0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes

10 minutes; 45 minutes; 30 minutes; 300 minutes; 90 minutes;

30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes

Determine the following five values.

• Min = 0
• Q1 = 20
• Med = 40
• Q3 = 60
• Max = 300

If you were the principal, would you be justified in purchasing new fitness equipment? Since 75% of the students exercise for 60 minutes or less daily, and since the IQR is 40 minutes (60 – 20 = 40), we know that half of the students surveyed exercise between 20 minutes and 60 minutes daily. This seems a reasonable amount of time spent exercising, so the principal would be justified in purchasing the new equipment.

However, the principal needs to be careful. The value 300 appears to be a potential outlier.

[Q_{3} + 1.5(IQR) = 60 + (1.5)(40) = 120].

The value 300 is greater than 120 so it is a potential outlier. If we delete it and calculate the five values, we get the following values:

• Min = 0
• Q1 = 20
• Q3 = 60
• Max = 120

We still have 75% of the students exercising for 60 minutes or less daily and half of the students exercising between 20 and 60 minutes a day. However, 15 students is a small sample and the principal should survey more students to be sure of his survey results.

References

1. Cauchon, Dennis, Paul Overberg. “Census data shows minorities now a majority of U.S. births.” USA Today, 2012. Available online at http://usatoday30.usatoday.com/news/...sus/55029100/1 (accessed April 3, 2013).
2. Data from the United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/ (accessed April 3, 2013).
3. “1990 Census.” United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/main/www/cen1990.html (accessed April 3, 2013).
4. Data from San Jose Mercury News.
5. Data from Time Magazine; survey by Yankelovich Partners, Inc.

Chapter Review

The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are used to compare and interpret data. For example, an observation at the 50th percentile would be greater than 50 percent of the other obeservations in the set. Quartiles divide data into quarters. The first quartile (Q1) is the 25th percentile,the second quartile (Q2 or median) is 50th percentile, and the third quartile (Q3) is the the 75th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data values. The IQR is found by subtracting Q1 from Q3, and can help determine outliers by using the following two expressions.

• (Q_{3} + IQR(1.5))
• (Q_{1} - IQR(1.5))

Formula Review

[i = dfrac{k}{100}(n+1)]

where (i) = the ranking or position of a data value,

(k) = the kth percentile,

(n) = total number of data.

Expression for finding the percentile of a data value: (left(dfrac{x + 0.5y}{n} ight)(100))

where (x =) the number of values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile,

(y =) the number of data values equal to the data value for which you want to find the percentile,

(n =) total number of data

Interquartile Range
or IQR, is the range of the middle 50 percent of the data values; the IQR is found by subtracting the first quartile from the third quartile.
Outlier
an observation that does not fit the rest of the data
Percentile
a number that divides ordered data into hundredths; percentiles may or may not be part of the data. The median of the data is the second quartile and the 50th percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively.
Quartiles
the numbers that separate the data into quarters; quartiles may or may not be part of the data. The second quartile is the median of the data.

3.4: Measures of the Location of the Data

The mean is that value that is most commonly referred to as the average. We will use the term average as a synonym for the mean and the term typical value to refer generically to measures of location.

This plot shows histograms for 10,000 random numbers generated from a normal, an exponential, a Cauchy, and a lognormal distribution.

Normal Distribution The first histogram is a sample from a normal distribution. The mean is 0.005, the median is -0.010, and the mode is -0.144 (the mode is computed as the midpoint of the histogram interval with the highest peak).

The normal distribution is a symmetric distribution with well-behaved tails and a single peak at the center of the distribution. By symmetric, we mean that the distribution can be folded about an axis so that the 2 sides coincide. That is, it behaves the same to the left and right of some center point. For a normal distribution, the mean, median, and mode are actually equivalent. The histogram above generates similar estimates for the mean, median, and mode. Therefore, if a histogram or normal probability plot indicates that your data are approximated well by a normal distribution, then it is reasonable to use the mean as the location estimator. Exponential Distribution The second histogram is a sample from an exponential distribution. The mean is 1.001, the median is 0.684, and the mode is 0.254 (the mode is computed as the midpoint of the histogram interval with the highest peak).

The exponential distribution is a skewed, i. e., not symmetric, distribution. For skewed distributions, the mean and median are not the same. The mean will be pulled in the direction of the skewness. That is, if the right tail is heavier than the left tail, the mean will be greater than the median. Likewise, if the left tail is heavier than the right tail, the mean will be less than the median.

For skewed distributions, it is not at all obvious whether the mean, the median, or the mode is the more meaningful measure of the typical value. In this case, all three measures are useful. Cauchy Distribution The third histogram is a sample from a Cauchy distribution. The mean is 3.70, the median is -0.016, and the mode is -0.362 (the mode is computed as the midpoint of the histogram interval with the highest peak).

For better visual comparison with the other data sets, we restricted the histogram of the Cauchy distribution to values between -10 and 10. The full Cauchy data set in fact has a minimum of approximately -29,000 and a maximum of approximately 89,000.

The Cauchy distribution is a symmetric distribution with heavy tails and a single peak at the center of the distribution. The Cauchy distribution has the interesting property that collecting more data does not provide a more accurate estimate of the mean. That is, the sampling distribution of the mean is equivalent to the sampling distribution of the original data. This means that for the Cauchy distribution the mean is useless as a measure of the typical value. For this histogram, the mean of 3.7 is well above the vast majority of the data. This is caused by a few very extreme values in the tail. However, the median does provide a useful measure for the typical value.

Although the Cauchy distribution is an extreme case, it does illustrate the importance of heavy tails in measuring the mean. Extreme values in the tails distort the mean. However, these extreme values do not distort the median since the median is based on ranks. In general, for data with extreme values in the tails, the median provides a better estimate of location than does the mean. Lognormal Distribution The fourth histogram is a sample from a lognormal distribution. The mean is 1.677, the median is 0.989, and the mode is 0.680 (the mode is computed as the midpoint of the histogram interval with the highest peak).

The lognormal is also a skewed distribution. Therefore the mean and median do not provide similar estimates for the location. As with the exponential distribution, there is no obvious answer to the question of which is the more meaningful measure of location. Robustness There are various alternatives to the mean and median for measuring location. These alternatives were developed to address non-normal data since the mean is an optimal estimator if in fact your data are normal.

Robustness of validity means that the confidence intervals for the population location have a 95% chance of covering the population location regardless of what the underlying distribution is.

The median is an example of a an estimator that tends to have robustness of validity but not robustness of efficiency.

How to Determine Measures of Position (Percentiles and Quartiles)

Although you may not often use measures such as percentiles and quartiles, these values are used to describe data in some situations, and knowing how to interpret them is beneficial.

o Determine the range of a data set

o Know how to interpret and determine measures of position (percentiles and quartiles)

While measures of central tendency, dispersion, and skewness are used often in statistics, there are other methods of characterizing or describing data distributions or portions that are commonly used as well. We will examine several of these statistical measures, some of which you may already know or have seen elsewhere.

The range of a data set is simply the difference between the maximum and minimum values of the set. (This measure is typically considered a measure of dispersion, since it is a simple description of how far the data extends.) Thus, if a data set such as <x1, x2, x3. xN> is provided in increasing order so that xi < xi+1, then the range of the data set is simply xNx1. If the data set is not ordered, then you must simply determine by inspection the maximum and minimum values.

Practice Problem:Find the range of the following data set.

Solution: We can find the range either by simply looking for the maximum and minimum values or by arranging the set in increasing order and then subtracting the first element from the last. Although the latter approach is a bit more time consuming, it can be beneficial in cases where you need to perform other calculations. So, let's order the data set for the sake of completeness.

The range is then 15 – 1 = 14.

Quartiles and Percentiles

Virtually anyone who has taken a standardized test at one time or another is familiar with the term percentile. Although percentiles seem dangerously similar to percentages (that is, "percent correct," referring to the number of questions answered correctly divided by the total number of questions, all multiplied by 100%), they are actually different. A similar measurement is the quartile, which we will also discuss. Both percentiles and quartiles are statistical measures of position that is, they do not measure a central tendency or a spread (dispersion), but instead measure location in a data set. (The exact definition of a percentile and quartile differs these differences, however, tend to be minor and are focused on certain fine points. Also, these differences tend to disappear when the number of data values in the set is large.)

Let's consider a number p, where p is a whole number between 0 and 100. Assume that the number p describes the percentage of values less than or equal to some data value Np. Consequently, 100 – p is the percentage of values greater than Np. This number Np is the pth percentile. Thus, to say that some data value x is the 75th percentile is to say that 75% of all the values in the data set are less than or equal to x, and that 25% of the data values are greater than x. Note that the percentile of a data value can also be understood as 100 times the cumulative relative frequency of that value. (Recall that the cumulative relative frequency of a value x is the relative frequency of all values less than or equal to x.) So, a student who gets a test score in the 90th percentile, for instance, hasn't (necessarily) scored 90/100 correct--he simply has a score that is at least as good as 90% of the other students. Although such a description isn't necessarily very satisfying for the student (who is probably more interested in finding out his percentage of correct answers), it is statistically helpful in certain situations. Typically, the 0th and 100th percentiles are not discussed, because these values are simply the minimum and maximum (respectively) of the data set.

Practice Problem: For the data set below, which value is in the 75th percentile?

Solution: We want to find the data value Np for which 75% of the data set is less than or equal to Np. Note that there are a total of 16 values in the set thus, 75% of the data set is 12 values. Because the data set is ordered, we need simply find the 12th data value then, 75% (12 out of 16 values) of the data set will be less than or equal to this value. The number 10 is the 75th percentile: 75% of the values in the set are less than or equal to 10.

Practice Problem: Which of the following data values is the 50th percentile?

Solution: The 50th percentile is that value N for which 50% of the values in the set are less than or equal to N. To help us find this value, let's first order the data set.

The data set has 10 values thus, the 50th percentile is the fifth data value, 5.52. Exactly half (50%) of the data values are less than or equal to 5.52, and the remaining half are greater than 5.52.

Another measure of position is the quartile, which is similar to the percentile except that it divides data into quarters (segments of 25% each) instead of hundredths. Thus, the nth quartile is the value x for which (25n)% of the values are less than or equal to x. Three quartiles are defined: Q1, Q2, and Q3. The quartile Q1 corresponds to the 25th percentile, Q2 to the 50th percentile, and Q3 to the 75th percentile.

The Q2 and the 50th percentile are sometimes said to correspond to the median of a data set. Given our definition of a median, this is true when there are an odd number of data values it is not strictly true for an even number of data values (see the practice problem above)--the median, according to our definition, would actually be the mean of 5.52 and 5.97. We could, however, say that this median value (5.75) is the 50th percentile for the data set: technically, half the values in the data set are below this value, and half are above. Thus, we can still maintain our definition of the median if we appropriately define percentiles and quartiles. In addition, we can also note that Q1 is the median of the first half of the values, and Q3 is the median of the second half of the values. (Our above considerations on the definition of the median apply here as well.)

Practice Problem: What is Q3 for the following data set?

Solution: Q3 is the value x for which 75% (three out of four) of the data values are at most x. Since there are eight members in the data set, the sixth value is Q3-75. This value is also the 75th percentile.

Measures of the Location of the Data

Quartiles are special percentiles. The first quartile, Q1, is the same as the 25 th percentile, and the third quartile, Q3, is the same as the 75 th percentile. The median, M, is called both the second quartile and the 50 th percentile.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90 th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75 th percentile. That translates into a score of at least 1220.

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the “center” of the data. You can think of the median as the “middle value,” but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data.
1 11.5 6 7.2 4 8 9 10 6.8 8.3 2 2 10 1
Ordered from smallest to largest:
1 1 2 2 4 6 6.8 7.2 8 8.3 9 10 10 11.5

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:
1 1 2 2 4 6 6.8 7.2 8 8.3 9 10 10 11.5

The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two.
1 1 2 2 4 6 6.8

The number two, which is part of the data, is the first quartile . One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.

The third quartile , Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile. Potential outliers always require further investigation.

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.

For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.
389,950 230,500 158,000 479,000 639,000 114,950 5,500,000 387,000 659,000 529,000 575,000 488,800 1,095,000

Order the data from smallest to largest.
114,950 158,000 230,500 387,000 389,950 479,000 488,800 529,000 575,000 639,000 659,000 1,095,000 5,500,000

Q1 = = 308,750

Q3 = = 649,000

IQR = 649,000 – 308,750 = 340,250

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier .

For the following 11 salaries, calculate the IQR and determine if any salaries are outliers. The salaries are in dollars.

?33,000 ?64,500 ?28,000 ?54,000 ?72,000 ?68,500 ?69,000 ?42,000 ?54,000 ?120,000 ?40,500

For the two data sets in the test scores example, find the following:

1. The interquartile range. Compare the two interquartile ranges.
2. Any outliers in either set.

The five number summary for the day and night classes is

The IQR for the day group is Q3Q1 = 82.5 – 56 = 26.5

The IQR for the night group is Q3Q1 = 89 – 78 = 11

The interquartile range (the spread or variability) for the day class is larger than the night class IQR. This suggests more variation will be found in the day class’s class test scores.

Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers.

Night class outliers are calculated as:

For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier.

Find the interquartile range for the following two data sets and compare them.

Test Scores for Class A
69 96 81 79 65 76 83 99 89 67 90 77 85 98 66 91 77 69 80 94
Test Scores for Class B
90 72 80 92 90 97 92 75 79 68 70 80 99 95 78 73 71 68 95 100

Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were:

AMOUNT OF SLEEP PER SCHOOL NIGHT (HOURS) FREQUENCY RELATIVE FREQUENCY CUMULATIVE RELATIVE FREQUENCY
4 2 0.04 0.04
5 5 0.10 0.14
6 7 0.14 0.28
7 12 0.24 0.52
8 14 0.28 0.80
9 7 0.14 0.94
10 3 0.06 1.00

Find the 28 th percentile. Notice the 0.28 in the “cumulative relative frequency” column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28 th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28 th percentile is between the last six and the first seven. The 28 th percentile is 6.5.

Find the median. Look again at the “cumulative relative frequency” column and find 0.52. The median is the 50 th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50 th percentile is between the 25 th , or seven, and 26 th , or seven, values. The median is seven.

Find the third quartile. The third quartile is the same as the 75 th percentile. You can “eyeball” this answer. If you look at the “cumulative relative frequency” column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75 th percentile, then, must be an eight. Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38 th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.)

Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65 th percentile.

Amount of time spent on route (hours) Frequency Relative Frequency Cumulative Relative Frequency
2 12 0.30 0.30
3 14 0.35 0.65
4 10 0.25 0.90
5 4 0.10 1.00

1. Find the 80 th percentile.
2. Find the 90 th percentile.
3. Find the first quartile. What is another name for the first quartile?

Using the data from the frequency table, we have:

1. The 80 th percentile is between the last eight and the first nine in the table (between the 40 th and 41 st values). Therefore, we need to take the mean of the 40 th an 41 st values. The 80 th percentile
2. The 90 th percentile will be the 45 th data value (location is 0.90(50) = 45) and the 45 th data value is nine.
3. Q1 is also the 25 th percentile. The 25 th percentile location calculation: P25 = 0.25(50) = 12.5 ≈ 13 the 13 th data value. Thus, the 25th percentile is six.

Refer to the (Figure). Find the third quartile. What is another name for the third quartile?

Your instructor or a member of the class will ask everyone in class how many sweaters they own. Answer the following questions:

1. How many students were surveyed?
2. What kind of sampling did you do?
3. Construct two different histograms. For each, starting value = _____ ending value = ____.
4. Find the median, first quartile, and third quartile.
5. Construct a table of the data to find the following:
1. the 10 th percentile
2. the 70 th percentile
3. the percent of students who own less than four sweaters

A Formula for Finding the kth Percentile

If you were to do a little research, you would find several formulas for calculating the k th percentile. Here is one of them.

k = the k th percentile. It may or may not be part of the data.

i = the index (ranking or position of a data value)

n = the total number of data

• Order the data from smallest to largest.
• Calculate
• If i is an integer, then the k th percentile is the data value in the i th position in the ordered set of data.
• If i is not an integer, then round i up and round i down to the nearest integers. Average the two data values in these two positions in the ordered data set. This is easier to understand in an example.

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.

18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77
Calculate the 20 th percentile and the 55 th percentile.

You can calculate percentiles using calculators and computers. There are a variety of online calculators.

A Formula for Finding the Percentile of a Value in a Data Set

• Order the data from smallest to largest.
• x = the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile.
• y = the number of data values equal to the data value for which you want to find the percentile.
• n = the total number of data.
• Calculate (100). Then round to the nearest integer.

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77

1. Counting from the bottom of the list, there are 18 data values less than 58. There is one value of 58.

x = 18 and y = 1.(100) = (100) = 63.80. 58 is the 64 th percentile.

x = 3 and y = 1.(100) = (100) = 12.07. Twenty-five is the 12 th percentile.

Listed are 30 ages for Academy Award winning best actors in order from smallest to largest.

18 21 22 25 26 27 29 30 31, 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77
Find the percentiles for 47 and 31.

Interpreting Percentiles, Quartiles, and Median

A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15% of data values are less than or equal to the 15 th percentile.

• Low percentiles always correspond to lower data values.
• High percentiles always correspond to higher data values.

A percentile may or may not correspond to a value judgment about whether it is “good” or “bad.” The interpretation of whether a certain percentile is “good” or “bad” depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered “good” in other contexts a high percentile might be considered “good”. In many situations, there is no value judgment that applies.

Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text.

When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information.

• information about the context of the situation being considered
• the data value (value of the variable) that represents the percentile
• the percent of individuals or items with data values below the percentile
• the percent of individuals or items with data values above the percentile.

On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.

• Twenty-five percent of students finished the exam in 35 minutes or less.
• Seventy-five percent of students finished the exam in 35 minutes or more.
• A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)

For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret the third quartile in the context of the situation.

On a 20 question math test, the 70 th percentile for number of correct answers was 16. Interpret the 70 th percentile in the context of this situation.

On a 60 point written assignment, the 80 th percentile for the number of points earned was 49. Interpret the 80 th percentile in the context of this situation.

At a community college, it was found that the 30 th percentile of credit units that students are enrolled for is seven units. Interpret the 30 th percentile in the context of this situation.

During a season, the 40 th percentile for points scored per player in a game is eight. Interpret the 40 th percentile in the context of this situation.

Sharpe Middle School is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students are shown.

0 minutes 40 minutes 60 minutes 30 minutes 60 minutes

10 minutes 45 minutes 30 minutes 300 minutes 90 minutes

30 minutes 120 minutes 60 minutes 0 minutes 20 minutes

Determine the following five values.

If you were the principal, would you be justified in purchasing new fitness equipment? Since 75% of the students exercise for 60 minutes or less daily, and since the IQR is 40 minutes (60 – 20 = 40), we know that half of the students surveyed exercise between 20 minutes and 60 minutes daily. This seems a reasonable amount of time spent exercising, so the principal would be justified in purchasing the new equipment.

However, the principal needs to be careful. The value 300 appears to be a potential outlier.

The value 300 is greater than 120 so it is a potential outlier. If we delete it and calculate the five values, we get the following values:

We still have 75% of the students exercising for 60 minutes or less daily and half of the students exercising between 20 and 60 minutes a day. However, 15 students is a small sample and the principal should survey more students to be sure of his survey results.

References

Cauchon, Dennis, Paul Overberg. “Census data shows minorities now a majority of U.S. births.” USA Today, 2012. Available online at http://usatoday30.usatoday.com/news/nation/story/2012-05-17/minority-birthscensus/55029100/1 (accessed April 3, 2013).

Data from the United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/ (accessed April 3, 2013).

“1990 Census.” United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/main/www/cen1990.html (accessed April 3, 2013).

Data from San Jose Mercury News.

Data from Time Magazine survey by Yankelovich Partners, Inc.

Chapter Review

The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are used to compare and interpret data. For example, an observation at the 50 th percentile would be greater than 50 percent of the other obeservations in the set. Quartiles divide data into quarters. The first quartile (Q1) is the 25 th percentile,the second quartile (Q2 or median) is 50 th percentile, and the third quartile (Q3) is the the 75 th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data values. The IQR is found by subtracting Q1 from Q3, and can help determine outliers by using the following two expressions.

Formula Review

where i = the ranking or position of a data value,

Expression for finding the percentile of a data value: (100)

where x = the number of values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile,

y = the number of data values equal to the data value for which you want to find the percentile,

Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.

18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77

Listed are 32 ages for Academy Award winning best actors in order from smallest to largest.

18 18 21 22 25 26 27 29 30 31 31 33 36 37 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77

Jesse was ranked 37 th in his graduating class of 180 students. At what percentile is Jesse’s ranking?

Jesse graduated 37 th out of a class of 180 students. There are 180 – 37 = 143 students ranked below Jesse. There is one rank of 37.

x = 143 and y = 1. (100) = (100) = 79.72. Jesse’s rank of 37 puts him at the 80 th percentile.

1. For runners in a race, a low time means a faster run. The winners in a race have the shortest running times. Is it more desirable to have a finish time with a high or a low percentile when running a race?
2. The 20 th percentile of run times in a particular race is 5.2 minutes. Write a sentence interpreting the 20 th percentile in the context of the situation.
3. A bicyclist in the 90 th percentile of a bicycle race completed the race in 1 hour and 12 minutes. Is he among the fastest or slowest cyclists in the race? Write a sentence interpreting the 90 th percentile in the context of the situation.
1. For runners in a race, a higher speed means a faster run. Is it more desirable to have a speed with a high or a low percentile when running a race?
2. The 40 th percentile of speeds in a particular race is 7.5 miles per hour. Write a sentence interpreting the 40 th percentile in the context of the situation.
1. For runners in a race it is more desirable to have a high percentile for speed. A high percentile means a higher speed which is faster.
2. 40% of runners ran at speeds of 7.5 miles per hour or less (slower). 60% of runners ran at speeds of 7.5 miles per hour or more (faster).

On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain.

Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes is the 85 th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85 th percentile in the context of this situation.

When waiting in line at the DMV, the 85 th percentile would be a long wait time compared to the other people waiting. 85% of people had shorter wait times than Mina. In this context, Mina would prefer a wait time corresponding to a lower percentile. 85% of people at the DMV waited 32 minutes or less. 15% of people at the DMV waited 32 minutes or longer.

In a survey collecting data about the salaries earned by recent college graduates, Li found that her salary was in the 78 th percentile. Should Li be pleased or upset by this result? Explain.

In a study collecting data about the repair costs of damage to automobiles in a certain type of crash tests, a certain model of car had ?1,700 in damage and was in the 90 th percentile. Should the manufacturer and the consumer be pleased or upset by this result? Explain and write a sentence that interprets the 90 th percentile in the context of this problem.

The manufacturer and the consumer would be upset. This is a large repair cost for the damages, compared to the other cars in the sample. INTERPRETATION: 90% of the crash tested cars had damage repair costs of ?1700 or less only 10% had damage repair costs of ?1700 or more.

The University of California has two criteria used to set admission standards for freshman to be admitted to a college in the UC system:

1. Students’ GPAs and scores on standardized tests (SATs and ACTs) are entered into a formula that calculates an “admissions index” score. The admissions index score is used to set eligibility standards intended to meet the goal of admitting the top 12% of high school students in the state. In this context, what percentile does the top 12% represent?
2. Students whose GPAs are at or above the 96 th percentile of all students at their high school are eligible (called eligible in the local context), even if they are not in the top 12% of all students in the state. What percentage of students from each high school are “eligible in the local context”?

Suppose that you are buying a house. You and your realtor have determined that the most expensive house you can afford is the 34 th percentile. The 34 th percentile of housing prices is ?240,000 in the town you want to move to. In this town, can you afford 34% of the houses or 66% of the houses?

You can afford 34% of houses. 66% of the houses are too expensive for your budget. INTERPRETATION: 34% of houses cost ?240,000 or less. 66% of houses cost ?240,000 or more.

Use the following information to answer the next six exercises. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. Fourteen people answered that they generally sell three cars nineteen generally sell four cars twelve generally sell five cars nine generally sell six cars eleven generally sell seven cars.

Measures of the Location of the Data

Quartiles are special percentiles. The first quartile, Q1, is the same as the 25 th percentile, and the third quartile, Q3, is the same as the 75 th percentile. The median, M, is called both the second quartile and the 50 th percentile.

To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in the 90 th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75 th percentile. That translates into a score of at least 1220.

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the "center" of the data. You can think of the median as the "middle value," but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data. * * *

1 11.5 6 7.2 4 8 9 10 6.8 8.3 2 2 10 1 * * *

Ordered from smallest to largest: * * *

1 1 2 2 4 6 6.8 7.2 8 8.3 9 10 10 11.5

Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set: * * *

1 1 2 2 4 6 6.8 7.2 8 8.3 9 10 10 11.5

The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two. * * *

The number two, which is part of the data, is the first quartile. One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.

The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are less than nine. One-fourth (25%) of the ordered data set are greater than nine. The third quartile is part of the data set in this example.

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1).

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR) above the third quartile. Potential outliers always require further investigation.

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.

For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars. * * *

389,950 230,500 158,000 479,000 639,000 114,950 5,500,000 387,000 659,000 529,000 575,000 488,800 1,095,000

Order the data from smallest to largest. * * *

114,950 158,000 230,500 387,000 389,950 479,000 488,800 529,000 575,000 639,000 659,000 1,095,000 5,500,000

IQR = 649,000 – 308,750 = 340,250

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier.

For the following 11 salaries, calculate the IQR and determine if any salaries are outliers. The salaries are in dollars.

$33,000$64,500 $28,000$54,000 $72,000$68,500 $69,000$42,000 $54,000$120,000 $40,500 For the two data sets in the test scores example, find the following: 1. The interquartile range. Compare the two interquartile ranges. 2. Any outliers in either set. The five number summary for the day and night classes is The IQR for the day group is Q3Q1 = 82.5 – 56 = 26.5 The IQR for the night group is Q3Q1 = 89 – 78 = 11 The interquartile range (the spread or variability) for the day class is larger than the night class IQR. This suggests more variation will be found in the day class’s class test scores. Day class outliers are found using the IQR times 1.5 rule. So, Since the minimum and maximum values for the day class are greater than 16.25 and less than 122.25, there are no outliers. Night class outliers are calculated as: For this class, any test score less than 61.5 is an outlier. Therefore, the scores of 45 and 25.5 are outliers. Since no test score is greater than 105.5, there is no upper end outlier. Find the interquartile range for the following two data sets and compare them. 69 96 81 79 65 76 83 99 89 67 90 77 85 98 66 91 77 69 80 94 * * * 90 72 80 92 90 97 92 75 79 68 70 80 99 95 78 73 71 68 95 100 Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were: AMOUNT OF SLEEP PER SCHOOL NIGHT (HOURS) FREQUENCY RELATIVE FREQUENCY CUMULATIVE RELATIVE FREQUENCY 4 2 0.04 0.04 5 5 0.10 0.14 6 7 0.14 0.28 7 12 0.24 0.52 8 14 0.28 0.80 9 7 0.14 0.94 10 3 0.06 1.00 Find the 28 th percentile. Notice the 0.28 in the "cumulative relative frequency" column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28 th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28 th percentile is between the last six and the first seven. The 28 th percentile is 6.5. Find the median. Look again at the "cumulative relative frequency" column and find 0.52. The median is the 50 th percentile or the second quartile. 50% of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 50 th percentile is between the 25 th , or seven, and 26 th , or seven, values. The median is seven. Find the third quartile. The third quartile is the same as the 75 th percentile. You can "eyeball" this answer. If you look at the "cumulative relative frequency" column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and sevens, you have 52% of the data. When you include all the 8s, you have 80% of the data. The 75 th percentile, then, must be an eight. Another way to look at the problem is to find 75% of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38 th value, which is an eight. You can check this answer by counting the values. (There are 37 values below the third quartile and 12 values above.) Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65 th percentile. Amount of time spent on route (hours) Frequency Relative Frequency Cumulative Relative Frequency 2 12 0.30 0.30 3 14 0.35 0.65 4 10 0.25 0.90 5 4 0.10 1.00 1. Find the 80 th percentile. 2. Find the 90 th percentile. 3. Find the first quartile. What is another name for the first quartile? Using the data from the frequency table, we have: 1. The 80 th percentile is between the last eight and the first nine in the table (between the 40 th and 41 st values). Therefore, we need to take the mean of the 40 th an 41 st values. The 80 th percentile = 8 + 9 2 = 8.5 2. The 90 th percentile will be the 45 th data value (location is 0.90(50) = 45) and the 45 th data value is nine. 3. Q1 is also the 25 th percentile. The 25 th percentile location calculation: P25 = 0.25(50) = 12.5 ≈ 13 the 13 th data value. Thus, the 25th percentile is six. Refer to the [link]. Find the third quartile. What is another name for the third quartile? Your instructor or a member of the class will ask everyone in class how many sweaters they own. Answer the following questions: 1. How many students were surveyed? 2. What kind of sampling did you do? 3. Construct two different histograms. For each, starting value = \_\_\_\_\_ ending value = \_\_\_\_. 4. Find the median, first quartile, and third quartile. 5. Construct a table of the data to find the following: 1. the 10 th percentile 2. the 70 th percentile 3. the percent of students who own less than four sweaters A Formula for Finding the kth Percentile If you were to do a little research, you would find several formulas for calculating the k th percentile. Here is one of them. k = the k th percentile. It may or may not be part of the data. i = the index (ranking or position of a data value) n = the total number of data • Order the data from smallest to largest. • Calculate i = k 100 ( n + 1 ) • If i is an integer, then the k th percentile is the data value in the i th position in the ordered set of data. • If i is not an integer, then round i up and round i down to the nearest integers. Average the two data values in these two positions in the ordered data set. This is easier to understand in an example. Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. * * * 18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77 )(29 + 1) = 21. Twenty-one is an integer, and the data value in the 21 st position in the ordered data set is 64. The 70 th percentile is 64 years. )(29 + 1) = 24.9, which is NOT an integer. Round it down to 24 and up to 25. The age in the 24 th position is 71 and the age in the 25 th position is 72. Average 71 and 72. The 83 rd percentile is 71.5 years. Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. 18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77 * * * Calculate the 20 th percentile and the 55 th percentile. You can calculate percentiles using calculators and computers. There are a variety of online calculators. A Formula for Finding the Percentile of a Value in a Data Set • Order the data from smallest to largest. • x = the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile. • y = the number of data values equal to the data value for which you want to find the percentile. • n = the total number of data. • Calculate x + 0.5 y n (100). Then round to the nearest integer. Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. * * * 18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77 Counting from the bottom of the list, there are 18 data values less than 58. There is one value of 58. x = 18 and y = 1. x + 0.5 y n (100) = 63.80. 58 is the 64 th percentile. Counting from the bottom of the list, there are three data values less than 25. There is one value of 25. x = 3 and y = 1. x + 0.5 y n (100) = 12.07. Twenty-five is the 12 th percentile. Listed are 30 ages for Academy Award winning best actors in order from smallest to largest. 18 21 22 25 26 27 29 30 31, 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77 * * * Find the percentiles for 47 and 31. Interpreting Percentiles, Quartiles, and Median A percentile indicates the relative standing of a data value when data are sorted into numerical order from smallest to largest. Percentages of data values are less than or equal to the pth percentile. For example, 15% of data values are less than or equal to the 15 th percentile. • Low percentiles always correspond to lower data values. • High percentiles always correspond to higher data values. A percentile may or may not correspond to a value judgment about whether it is "good" or "bad." The interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered "good" in other contexts a high percentile might be considered "good". In many situations, there is no value judgment that applies. Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text. When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information. • information about the context of the situation being considered • the data value (value of the variable) that represents the percentile • the percent of individuals or items with data values below the percentile • the percent of individuals or items with data values above the percentile. On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation. • Twenty-five percent of students finished the exam in 35 minutes or less. • Seventy-five percent of students finished the exam in 35 minutes or more. • A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.) For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret the third quartile in the context of the situation. On a 20 question math test, the 70 th percentile for number of correct answers was 16. Interpret the 70 th percentile in the context of this situation. On a 60 point written assignment, the 80 th percentile for the number of points earned was 49. Interpret the 80 th percentile in the context of this situation. At a community college, it was found that the 30 th percentile of credit units that students are enrolled for is seven units. Interpret the 30 th percentile in the context of this situation. During a season, the 40 th percentile for points scored per player in a game is eight. Interpret the 40 th percentile in the context of this situation. Sharpe Middle School is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students are shown. 0 minutes 40 minutes 60 minutes 30 minutes 60 minutes 10 minutes 45 minutes 30 minutes 300 minutes 90 minutes 30 minutes 120 minutes 60 minutes 0 minutes 20 minutes Determine the following five values. If you were the principal, would you be justified in purchasing new fitness equipment? Since 75% of the students exercise for 60 minutes or less daily, and since the IQR is 40 minutes (60 – 20 = 40), we know that half of the students surveyed exercise between 20 minutes and 60 minutes daily. This seems a reasonable amount of time spent exercising, so the principal would be justified in purchasing the new equipment. However, the principal needs to be careful. The value 300 appears to be a potential outlier. The value 300 is greater than 120 so it is a potential outlier. If we delete it and calculate the five values, we get the following values: We still have 75% of the students exercising for 60 minutes or less daily and half of the students exercising between 20 and 60 minutes a day. However, 15 students is a small sample and the principal should survey more students to be sure of his survey results. References Cauchon, Dennis, Paul Overberg. “Census data shows minorities now a majority of U.S. births.” USA Today, 2012. Available online at http://usatoday30.usatoday.com/news/nation/story/2012-05-17/minority-birthscensus/55029100/1 (accessed April 3, 2013). Data from the United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/ (accessed April 3, 2013). “1990 Census.” United States Department of Commerce: United States Census Bureau. Available online at http://www.census.gov/main/www/cen1990.html (accessed April 3, 2013). Data from San Jose Mercury News. Data from Time Magazine survey by Yankelovich Partners, Inc. Chapter Review The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are used to compare and interpret data. For example, an observation at the 50 th percentile would be greater than 50 percent of the other obeservations in the set. Quartiles divide data into quarters. The first quartile (Q1) is the 25 th percentile,the second quartile (Q2 or median) is 50 th percentile, and the third quartile (Q3) is the the 75 th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data values. The IQR is found by subtracting Q1 from Q3, and can help determine outliers by using the following two expressions. Formula Review where i = the ranking or position of a data value, Expression for finding the percentile of a data value: ( x + 0.5 y n ) where x = the number of values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile, y = the number of data values equal to the data value for which you want to find the percentile, Listed are 29 ages for Academy Award winning best actors in order from smallest to largest. 18 21 22 25 26 27 29 30 31 33 36 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77 Listed are 32 ages for Academy Award winning best actors in order from smallest to largest. 18 18 21 22 25 26 27 29 30 31 31 33 36 37 37 41 42 47 52 55 57 58 62 64 67 69 71 72 73 74 76 77 Jesse was ranked 37 th in his graduating class of 180 students. At what percentile is Jesse’s ranking? Jesse graduated 37 th out of a class of 180 students. There are 180 – 37 = 143 students ranked below Jesse. There is one rank of 37. x = 143 and y = 1. x + 0.5 y n (100) = 79.72. Jesse’s rank of 37 puts him at the 80 th percentile. 1. For runners in a race, a low time means a faster run. The winners in a race have the shortest running times. Is it more desirable to have a finish time with a high or a low percentile when running a race? 2. The 20 th percentile of run times in a particular race is 5.2 minutes. Write a sentence interpreting the 20 th percentile in the context of the situation. 3. A bicyclist in the 90 th percentile of a bicycle race completed the race in 1 hour and 12 minutes. Is he among the fastest or slowest cyclists in the race? Write a sentence interpreting the 90 th percentile in the context of the situation. 1. For runners in a race, a higher speed means a faster run. Is it more desirable to have a speed with a high or a low percentile when running a race? 2. The 40 th percentile of speeds in a particular race is 7.5 miles per hour. Write a sentence interpreting the 40 th percentile in the context of the situation. 1. For runners in a race it is more desirable to have a high percentile for speed. A high percentile means a higher speed which is faster. 2. 40% of runners ran at speeds of 7.5 miles per hour or less (slower). 60% of runners ran at speeds of 7.5 miles per hour or more (faster). On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain. Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes is the 85 th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85 th percentile in the context of this situation. When waiting in line at the DMV, the 85 th percentile would be a long wait time compared to the other people waiting. 85% of people had shorter wait times than Mina. In this context, Mina would prefer a wait time corresponding to a lower percentile. 85% of people at the DMV waited 32 minutes or less. 15% of people at the DMV waited 32 minutes or longer. In a survey collecting data about the salaries earned by recent college graduates, Li found that her salary was in the 78 th percentile. Should Li be pleased or upset by this result? Explain. In a study collecting data about the repair costs of damage to automobiles in a certain type of crash tests, a certain model of car had$1,700 in damage and was in the 90 th percentile. Should the manufacturer and the consumer be pleased or upset by this result? Explain and write a sentence that interprets the 90 th percentile in the context of this problem.

The manufacturer and the consumer would be upset. This is a large repair cost for the damages, compared to the other cars in the sample. INTERPRETATION: 90% of the crash tested cars had damage repair costs of $1700 or less only 10% had damage repair costs of$1700 or more.

The University of California has two criteria used to set admission standards for freshman to be admitted to a college in the UC system:

1. Students’ GPAs and scores on standardized tests (SATs and ACTs) are entered into a formula that calculates an “admissions index” score. The admissions index score is used to set eligibility standards intended to meet the goal of admitting the top 12% of high school students in the state. In this context, what percentile does the top 12% represent?
2. Students whose GPAs are at or above the 96 th percentile of all students at their high school are eligible (called eligible in the local context), even if they are not in the top 12% of all students in the state. What percentage of students from each high school are “eligible in the local context”?

Suppose that you are buying a house. You and your realtor have determined that the most expensive house you can afford is the 34 th percentile. The 34 th percentile of housing prices is $240,000 in the town you want to move to. In this town, can you afford 34% of the houses or 66% of the houses? You can afford 34% of houses. 66% of the houses are too expensive for your budget. INTERPRETATION: 34% of houses cost$240,000 or less. 66% of houses cost $240,000 or more. Use the following information to answer the next six exercises. Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. Fourteen people answered that they generally sell three cars nineteen generally sell four cars twelve generally sell five cars nine generally sell six cars eleven generally sell seven cars. 3.4: Measures of the Location of the Data Q-1- Measures of the Location of the Data • i. Identify the type of data (quantitative - discrete, quantitative - continuous, or qualitative) that would be used to describe a response. • ii. Give an example of the data. • a. Number of tickets sold to a concert • b. Amount of body fat • c. Favorite baseball team • d. Time in line to buy groceries • e. Number of students enrolled at Evergreen Valley College • f. Mos–watched television show • g. Brand of toothpaste • h. Distance to the closest movie theatre • i. Age of executives in Fortune 500 companies • j. Number of competing computer spreadsheet software packages • a. quantitative - discrete • b. quantitative - continuous • c. qualitative • d. quantitative - continuous • e. quantitative - discrete • f. qualitative • g. qualitative • h. quantitative - continuous • i. quantitative - continuous • j. quantitative - discrete Fifty part-time students were asked how many courses they were taking this term. The (incomplete) results are shown below: • a. Fill in the blanks in the table above. • b. What percent of students take exactly two courses? • c. What percent of students take one or two courses? Sixty adults with gum disease were asked the number of times per week they used to floss before their diagnoses. The (incomplete) results are shown below: • a. Fill in the blanks in the table above. • b. What percent of adults flossed six times per week? • c. What percent flossed at most three times per week? A fitness center is interested in the average amount of time a client exercises in the center each week. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data Ski resorts are interested in the average age that children take their first ski and snowboard lessons. They need this information to optimally plan their ski classes. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data • a. Children who take ski or snowboard lessons • b. A group of these children • c. The population average • d. The sample average • e. X = the age of one child who takes the first ski or snowboard lesson • f. A value for X , such as 3, 7, etc. A cardiologist is interested in the average recovery period for her patients who have had heart attacks. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data Insurance companies are interested in the average health costs each year for their clients, so that they can determine the costs of health insurance. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data • a. The clients of the insurance companies • b. A group of the clients • c. The average health costs of the clients • d. The average health costs of the sample • e. X = the health costs of one client • f. A value for X , such as 34, 9, 82, etc. A politician is interested in the proportion of voters in his district that think he is doing a good job. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data A marriage counselor is interested in the proportion the clients she counsels that stay married. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data • a. All the clients of the counselor • b. A group of the clients • c. The proportion of all her clients who stay married • d. The proportion of the sample who stay married • e. X = the number of couples who stay married • f. yes, no Political pollsters may be interested in the proportion of people that will vote for a particular cause. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data A marketing company is interested in the proportion of people that will buy a particular product. Define the following in terms of the study. Give examples where appropriate. • a. Population • b. Sample • c. Parameter • d. Statistic • e. Variable • f. Data • a. All people (maybe in a certain geographic area, such as the United States) • b. A group of the people • c. The proportion of all people who will buy the product • d. The proportion of the sample who will buy the product • e. X = the number of people who will buy it • f. buy, not buy Airline companies are interested in the consistency of the number of babies on each flight, so that they have adequate safety equipment. Suppose an airline conducts a survey. Over Thanksgiving weekend, it surveys 6 flights from Boston to Salt Lake City to determine the number of babies on the flights. It determines the amount of safety equipment needed by the result of that study. • a. Using complete sentences, list three things wrong with the way the survey was conducted. • b. Using complete sentences, list three ways that you would improve the survey if it were to be repeated. Suppose you want to determine the average number of students per statistics class in your state. Describe a possible sampling method in 3– 5 complete sentences. Make the description detailed. Suppose you want to determine the average number of cans of soda drunk each month by persons in their twenties. Describe a possible sampling method in 3 - 5 complete sentences. Make the description detailed. 726 distance learning students at Long Beach City College in the 2004-2005 academic year were surveyed and asked the reasons they took a distance learning class. ( Source: Amit Schitai, Director of Instructional Technology and Distance Learning, LBCC ). The results of this survey are listed in the table below.  Reasons for Taking LBCC Distance Learning Courses Convenience 87.6% Unable to come to campus 85.1% Taking on-campus courses in addition to my DL course 71.7% Instructor has a good reputation 69.1% To fulfill requirements for transfer 60.8% To fulfill requirements for Associate Degree 53.6% Thought DE would be more varied and interesting 53.2% I like computer technology 52.1% Had success with previous DL course 52.0% On-campus sections were full 42.1% To fulfill requirements for vocational certification 27.1% Because of disability 20.5% Assume that the survey allowed students to choose from the responses listed in the table above. • a. Why can the percents add up to over 100%? • b. Does that necessarily imply a mistake in the report? • c. How do you think the question was worded to get responses that totaled over 100%? • d. How might the question be worded to get responses that totaled 100%? Nineteen immigrants to the U.S were asked how many years, to the nearest year, they have lived in the U.S. The data are as follows: 2 5 7 2 2 10 20 15 0 7 0 20 5 12 15 12 4 5 10 The following table was produced: • a. Fix the errors on the table. Also, explain how someone might have arrived at the incorrect number(s). • b. Explain what is wrong with this statement:󈬟 percent of the people surveyed have lived in the U.S. for 5 years” • c. Fix the statement above to make it correct. • d. What fraction of the people surveyed have lived in the U.S. 5 or 7 years? • e. What fraction of the people surveyed have lived in the U.S. at most 12 years? • f. What fraction of the people surveyed have lived in the U.S. fewer than 12 years? • g. What fraction of the people surveyed have lived in the U.S. from 5 to 20 years, inclusive? A“random surve” was conducted of 3274 people of the“microprocessor generatio” (people born since 1971, the year the microprocessor was invented). It was reported that 48% of those individuals surveyed stated that if they had$2000 to spend, they would use it for computer equipment. Also, 66% of those surveyed considered themselves relatively savvy computer users. ( Source: San Jose Mercury News )

• a. Do you consider the sample size large enough for a study of this type? Why or why not?
• b. Based on your“gut feeling” do you believe the percents accurately reflect the U.S. population for those individuals born since 1971? If not, do you think the percents of the population are actually higher or lower than the sample statistics? Why?

Additional information: The survey was reported by Intel Corporation of individuals who visited the Los Angeles Convention Center to see the Smithsonian Institure's road show called“Americ’s Smithsonian”

• c. With this additional information, do you feel that all demographic and ethnic groups were equally represented at the event? Why or why not?
• d. With the additional information, comment on how accurately you think the sample statistics reflect the population parameters.
• a. List some practical difficulties involved in getting accurate results from a telephone survey.
• b. List some practical difficulties involved in getting accurate results from a mailed survey.
• c. With your classmates, brainstorm some ways to overcome these problems if you needed to conduct a phone or mail survey.

Try these multiple choice questions

The next four questions refer to the following: A Lake Tahoe Community College instructor is interested in the average number of days Lake Tahoe Community College math students are absent from class during a quarter.

What is the population she is interested in?

• A. All Lake Tahoe Community College students
• B. All Lake Tahoe Community College English students
• C. All Lake Tahoe Community College students in her classes
• D. All Lake Tahoe Community College math students

X = number of days a Lake Tahoe Community College math student is absent

In this case, X is an example of a:

The instructor takes her sample by gathering data on 5 randomly selected students from each Lake Tahoe Community College math class. The type of sampling she used is

• A. Cluster sampling
• B. Stratified sampling
• C. Simple random sampling
• D. Convenience sampling

The instructo’s sample produces an average number of days absent of 3.5 days. This value is an example of a

The next two questions refer to the following relative frequency table on hurricanes that have made direct hits on the U.S between 1851 and 2004. Hurricanes are given a strength category rating based on the minimum wind speed generated by the storm. ( http://www.nhc.noaa.gov/gifs/table5.gif )

 Frequency of Hurricane Direct Hits Category Number of Direct Hits Relative Frequency Cumulative Frequency Total = 273 1 109 0.3993 0.3993 2 72 0.2637 0.6630 3 71 0.2601 4 18 0.9890 5 3 0.0110 1.0000

What is the relative frequency of direct hits that were category 4 hurricanes?

What is the relative frequency of direct hits that were AT MOST a category 3 storm?

The next three questions refer to the following: A study was done to determine the age, number of times per week and the duration (amount of time) of resident use of a local park in San Jose. The first house in the neighborhood around the park was selected randomly and then every 8th house in the neighborhood around the park was interviewed.

‘Number of times per wee’ is what type of data?

‘Duration (amount of time’ is what type of data?

[Your opinion is important to us. If you have a comment, correction or question pertaining to this chapter please send it to the appropriate person listed in contact information or visit forums for this course.]

Introduction

The common measures of location are quartiles and percentiles.

Quartiles are special percentiles. The first quartile, Q1, is the same as the 25 th percentile, and the third quartile, Q3, is the same as the 75 th percentile. The median, M, is called both the second quartile and the 50 th percentile.

To calculate quartiles and percentiles, you must order the data from smallest to largest. Quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. Recall that a percent means one-hundredth. So, percentiles mean the data is divided into 100 sections. To score in the 90 th percentile of an exam does not mean, necessarily, that you received 90 percent on a test. It means that 90 percent of test scores are the same as or less than your score and that 10 percent of the test scores are the same as or greater than your test score.

Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75 th percentile. That translates into a score of at least 1220.

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90 percent of the test scores are less—and not the same or less—than your score, it would be acceptable because removing one particular data value is not significant.

The median is a number that measures the center of the data. You can think of the median as the middle value, but it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median, and half the values are the same number or larger. For example, consider the following data:

Since there are 14 observations (an even number of data values), the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.

The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median, or second, quartile. The first quartile, Q1, is the middle value of the lower half of the data, and the third quartile, Q3, is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:

1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5

The data set has an even number of values (14 data values), so the median will be the average of the two middle values (the average of 6.8 and 7.2), which is calculated as 6.8 + 7.2 2 6.8 + 7.2 2 and equals 7.

So, the median, or second quartile ( Q 2 Q 2 ), is 7.

The first quartile is the median of the lower half of the data, so if we divide the data into seven values in the lower half and seven values in the upper half, we can see that we have an odd number of values in the lower half. Thus, the median of the lower half, or the first quartile ( Q 1 Q 1 ) will be the middle value, or 2. Using the same procedure, we can see that the median of the upper half, or the third quartile ( Q 3 Q 3 ) will be the middle value of the upper half, or 9.

The quartiles are illustrated below:

The interquartile range is a number that indicates the spread of the middle half, or the middle 50 percent of the data. It is the difference between the third quartile (Q3) and the first quartile (Q1)

IQR = Q3Q1. The IQR for this data set is calculated as 9 minus 2, or 7.

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is less than 1.5 × IQR below the first quartile or more than 1.5 × IQR above the third quartile. Potential outliers always require further investigation.

A potential outlier is a data point that is significantly different from the other data points. These special data points may be errors or some kind of abnormality, or they may be a key to understanding the data.

Example 2.15

For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.

389,950 230,500 158,000 479,000 639,000 114,950 5,500,000 387,000 659,000 529,000 575,000 488,800 1,095,000

Order the following data from smallest to largest:

114,950, 158,000, 230,500, 387,000, 389,950, 479,000, 488,800, 529,000, 575,000, 639,000, 659,000, 1,095,000, 5,500,000.

IQR = 649,000 – 308,750 = 340,250

No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier .

For the 11 salaries, calculate the IQR and determine if any salaries are outliers. The following salaries are in dollars:

$33,000 ,$64,500 , $28,000 ,$54,000 , $72,000 ,$68,500 , $69,000 ,$42,000 , $54,000 ,$120,000 , \$40,500

In the example above, you just saw the calculation of the median, first quartile, and third quartile. These three values are part of the five number summary. The other two values are the minimum value (or min) and the maximum value (or max). The five number summary is used to create a box plot.

Find the interquartile range for the following two data sets and compare them:

69, 96, 81, 79, 65, 76, 83, 99, 89, 67, 90, 77, 85, 98, 66, 91, 77, 69, 80, 94

Example 2.16

Fifty statistics students were asked how much sleep they get per school night (rounded to the nearest hour). The results were as follows:

Amount of Sleep per School Night (Hours) Frequency Relative Frequency Cumulative Relative Frequency
4 2 0.04 0.04
5 5 0.10 0.14
6 7 0.14 0.28
7 12 0.24 0.52
8 14 0.28 0.80
9 7 0.14 0.94
10 3 0.06 1.00

Find the 28 th percentile. Notice the 0.28 in the Cumulative Relative Frequency column. Twenty-eight percent of 50 data values is 14 values. There are 14 values less than the 28 th percentile. They include the two 4s, the five 5s, and the seven 6s. The 28 th percentile is between the last six and the first seven. The 28 th percentile is 6.5.

Find the median. Look again at the Cumulative Relative Frequency column and find 0.52. The median is the 50 th percentile or the second quartile. Fifty percent of 50 is 25. There are 25 values less than the median. They include the two 4s, the five 5s, the seven 6s, and 11 of the 7s. The median or 50 th percentile is between the 25 th , or seven, and 26 th , or seven, values. The median is seven.

Find the third quartile. The third quartile is the same as the 75 th percentile. You can eyeball this answer. If you look at the Cumulative Relative Frequency column, you find 0.52 and 0.80. When you have all the fours, fives, sixes, and sevens, you have 52 percent of the data. When you include all the 8s, you have 80 percent of the data. The 75 th percentile, then, must be an eight. Another way to look at the problem is to find 75 percent of 50, which is 37.5, and round up to 38. The third quartile, Q3, is the 38 th value, which is an eight. You can check this answer by counting the values. There are 37 values below the third quartile and 12 values above.

Forty bus drivers were asked how many hours they spend each day running their routes (rounded to the nearest hour). Find the 65 th percentile.

Mean Median Mode Calculation

Measure of central tendency is a single value to describe a set of data by identifying the central position within that set of data. Measures of central tendency are sometimes called as measures of central location. It is used to find the mean, median and mode based on the measures of central location. Mean is the average of the sum of a set of data divided by the number of data. Median is the middle value of given two values and mode is the value which has more number of repetitions.

The purpose of identifying a “central” value from a data set was to describe a typical value in the data set. Once we know this, we can measure the amount of dispersion or spread of the data values from the typical, central, value. In other words, we’re going to calculate how “spread out” our data is. Three main measures of dispersion for a data set are the range, the variance, and the standard deviation.

The Range

The range of a variable is simply the “distance” between the largest data value and the smallest data value. In math symbols:

Range = largest data value – smallest data value

The table shown provides the first exam scores for a class with 11 students. The range for this data set is:

Calculating the range requires the use of only two values: the smallest and largest data values. If either of the two values changes, so does the range. Therefore, the range clearly is not resistant to extreme values in the data set. No other data values affect the range.

Sample Variance

The variance of a data set is a numerical summary that indicates the average deviation of each data value from the mean of a data set. The calculation of the variance of a data set requires us to compare each data value from our raw list, <x1,x2,x3,…,x(n-1),xn>, to the mean . The idea of deviation is just the difference, as computed by subtraction. In symbols, the deviation about the mean for the i th data value, xi, is the value: (xi – x̄).

Because of the definition of the mean of a data set, if you add up the deviation from the mean for each data value, you will always get zero. In symbols, Σ(xi – x̄) = 0. This is a bit technical, but what it basically means is that we cannot just average the sum of deviations. We’d always get zero!!

To get around this, we need a way to make all deviations from the mean positive, regardless of whether a data value is below or above the mean. For example, if you live two miles north of a city and I live two miles south of the same city, it would be ridiculous to say, “I live negative 2 miles from the city.” We both live two miles away.

Mathematically, one way to make all deviations positive is to use an absolute value. Another way, which we’ll use for calculating both the variation and standard deviation of a data set, is to square each deviation. For the city example, your deviation value would be 2 2 = 4 and my deviation value would be (-2) 2 = 4. Therefore, our deviation, regardless of being positive or negative, would be the same! So, to treat positive differences and negative differences as the same, we square the deviations: (xi – x̄) 2 .

Finally, since the variance measures the average deviation of each data value from the mean of the entire data set, we add up the squared-deviation value for each data point and divide by the value (n – 1), one fewer than the number of data values. This is another techical “difficulty” that we’ll deal with later. The value (n – 1) is given the special designation degrees of freedom of a data set. The reason for this will be made more clear throughout the class, but imagine the following simple scenario: You and four friends go to a Chinese restaurant, and at the end of the meal, your server brings your group 5 fortune cookies, setting them in a pile in the middle of the table. How many of your party of 5 get to actually choose their fortune? Only 4. The reason is obvious: after 4 people have had their choice of fortune cookies, only one remains. The fifth person has no choice of fortune. The degrees of freedom for this “problem” is thus 5 – 1 = 4.

Here’s another example, this time from a math standpoint: if someone tells you that they are thinking of 3 numbers whose average is 5, how many of the three numbers do you need to know before you know all 3? After a little thought, you’ll realize the answer is 2. If you are told that two of the numbers are 2 and 10, a little thought (and some algebra) will help you find that the last number must be

Again, the degrees of freedom for this problem is 3 – 1 = 2.

In all its glory, the math formula for calculating the sample variance is:

where n is the size of the sample.

Example: Calculations for a Sample Variance

Returning to the population of exam scores for a class with 11 students, the table above illustrates the (sometimes tedious!) calculations for the population variance. The mean of this population of data is = 82.

The total squared deviation for the population data is 1272. Therefore, the variance for the data set is:

Measures of Central Tendency

Measures of Central Tendency provide a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution. There are three main measures of central tendency: the mean, the median and the mode.

The mean of a data set is also known as the average value. It is calculated by dividing the sum of all values in a data set by the number of values.

So in a data set of 1, 2, 3, 4, 5, we would calculate the mean by adding the values (1+2+3+4+5) and dividing by the total number of values (5). Our mean then is 15/5, which equals 3.

Disadvantages to the mean as a measure of central tendency are that it is highly susceptible to outliers (observations which are markedly distant from the bulk of observations in a data set), and that it is not appropriate to use when the data is skewed, rather than being of a normal distribution.

Median

The median of a data set is the value that is at the middle of a data set arranged from smallest to largest.

In the data set 1, 2, 3, 4, 5, the median is 3.

In a data set with an even number of observations, the median is calculated by dividing the sum of the two middle values by two. So in: 1, 2, 3, 4, 5, 6, the median is (3+4)/2, which equals 3.5.

The median is appropriate to use with ordinal variables, and with interval variables with a skewed distribution.

The mode is the most common observation of a data set, or the value in the data set that occurs most frequently.

The mode has several disadvantages. It is possible for two modes to appear in the one data set (e.g. in: 1, 2, 2, 3, 4, 5, 5, both 2 and 5 are the modes).

The mode is an appropriate measure to use with categorical data.

Resources

Designing and Conducting Health Systems Research Projects: Module 22 (Page 28) of this WHO guide provides instruction on the use of measures of central tendency.

This page is a stub (a minimal version of a page). You can help expand it. Click on Contribute Content or Contact Us to suggest additional resources, share your experience using the option, or volunteer to expand the description .