BIOSTATS 540 Fall 2022 1. Summarizing Data Page 1 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Unit 1
Summarizing Data
“It is difficult to understand why statisticians commonly limit their enquiries to averages, and do not revel in more
comprehensive views. Their souls seem as dull as the charm of variety as that of the native of one of our flat English
counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be
got rid of at once”
- Sir Francis Galton (England, 1822-1911)
This unit introduces variables and the variety of types of data possible (nominal, ordinal, interval, and
ratio).
It also introduces numerical ways of summarizing data. Numerical summaries of data include those
that describe central tendency (eg – mode, mean, median), those that describe dispersion (eg – range
and standard deviation), and those that describe the shape of the distribution (eg – 25
th
and 75
th
percentiles).
Graphical summaries (data visualizations) are introduced in the next unit, Unit 2.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 2 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Cheers!
Source: With permission, download from CAUSEweb.org
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 3 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Table of Contents
Topics
1. Unit Roadmap ………………………………………………..
2. Learning Objectives ………………………………………….
3. Variables and Types of Data ………………………………….
4. The Summation Notation ……………………………..……….
5. Numerical Summaries for Quantitative Data –Central Value …
a. The mode ……………………………………………..
b. The mean ……………………………………………..
c. The mean as a “balancing” point and skewness ……..
d. The mean of grouped data ……………………………
e. The median ……………………………………...........
6. Numerical Summaries for Quantitative Data - Dispersion…….
a. Variance ……………………………………………….
b. Standard Deviation ……………………………………
c. Median Absolute Deviation from Median …………….
d. Standard Deviation v Standard Error ………………….
e. A Feel for Sampling Distributions ……………………
f. The Coefficient of Variation ………………………….
g. The Range …………………………………………….
7. Some Other Important Numerical Summaries …………………
a. Frequencies, Relative Frequencies and More ……………
b. Percentiles …….…………………………………………
c. Five Number Summary ….………………………………
d. Interquartile Range, IQR ………………………………..
4
5
7
15
18
20
21
22
23
24
26
27
28
30
31
34
37
38
40
42
44
48
49
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 4 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
1. Unit Roadmap
Nature/
Populations
Sample
Unit 1.
Summarizing
Data
Observation/
Data
Observation and data are not the same. Think
about it. What you record as data is only some
of what you actually observe!! Thus, data are the
result of selection. Data are the values you
obtain by measurement. A variable is
something whose value can vary.
The purpose of summarizing data is to
communicate the relevant aspects of the data.
Tip – Ask yourself, “what is relevant?”
Goal - Aim for summaries that are simple,
clear, and thorough.
Tip – Avoid misleading summaries.
Relationships
Modeling
Analysis/
Synthesis
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 5 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
2. Learning Objectives
When you have finished this unit, you should be able to:
§ Explain the distinction between variable and data value.
§ Explain the distinction between qualitative and quantitative data.
§ Identify the type of variable represented by a variable and its data values.
§ Understand and know how to compute: percentile, five number summary, and interquartile
range, IQR.
§ Understand and know how to compute summary measures of central tendency: mode, mean,
median.
§ Understand and know how to compute other summary measures of dispersion: range,
interquartile range, standard deviation, sample variance, standard error.
§ Understand somewhat the distinction between standard deviation and standard error Note We
will discuss this again in Unit 3 (Probability Basics) and in Unit 5 (Populations and Samples).
§ Understand the importance of the type of data and the shape of the data distribution when
choosing which data summary to obtain.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 6 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
HOMEWORK due Friday September 23, 2022
Question #1 of 5
Dear all. This question checks that you have read the syllabus! The solutions are in the syllabus.
a) Are the exams “in-class”/proctored or are they take-home?
b) How are the exam grades weighted in the final course grade determination?
c) How are the homeworks graded?
d) Is attendance in Zoom classes required?
e) Your course score is not determined by the columns in Blackboard. How is the course
score determined?
f) How are the final course letter grades determined?
g) Is it possible to obtain the exam questions early?
h) Are late homework and exam submissions allowed (yes or no)?
i) What is the policy on late homework and late exam submissions?
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 7 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
3. Variables and Types of Data
Data can be of different types, and it matters…
Variables versus Data
A variable is something whose value can vary. It is a characteristic that is being measured. Examples of variables
are:
- AGE
- SEX
- BLOOD TYPE
A data value is the value of a variable (“realization” - a number or text response) that you obtain upon
measurement. Examples of data values are:
- 54 years
- female
- A
Consider the following little data set that is stored in a spreadsheet:
Variables are the column headings – “subject”, “age”, “sex”, “bloodtype”
subject
age
sex
bloodtype
1
54
female
A
2
32
male
B
3
24
female
AB
This data table (spreadsheet) has three observations (rows), four variables, and 12 data values.
Data values are the table cell
entries“54”, “female”, “A”,
etc.
‘54
‘s
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 8 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
The different data types are distinct because the scales of measurement are distinct.
There are a variety of schemes for organizing distinct data types. Nevertheless, they all capture the point that
differences in scale of measurement are what distinguish distinct data types.
§ Whitlock MC and Schluter D (“The Analysis of Biological Data, Second Edition”) classify data types as follows:
All Data Types
Categorical (Qualitative)
“attributes”
Numerical (Quantitative)
“numbers”
nominal
ordinal
discrete
continuous
Values are
“names”
that are
unordered
categories
Values are “names”
that are ordered
categories
Values are
integer values
0, 1, 2 … on a
proper
numeric scale
Values are a measured
number of units,
including possible
decimal values, on a
continuous scale
No units
No units
Counted units
Measured units
Example: Example: Example: Example:
GENDER PAIN LEVEL NUMBER OF VISITS WEIGHT
Male, female mild, moderate, severe 0, 1, 2, etc. 123.6 lbs, etc.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 9 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
The distinction between categorical versus numerical is straightforward:
Categorical: Attributes that do NOT have magnitude on a numerical scale
Numerical: Attributes or scores that DO have magnitude on a numerical scale
Example - To describe a flower as pretty is a categorical (qualitative) assessment while to record a child’s age as 11
years is a numerical (quantitative) measurement.
Consider this …
We can reasonably refer to the child’s 22 year old cousin as being twice as old as the 11 year old child whereas we
cannot reasonably describe an orchid as being twice as pretty as a dandelion.
We encounter similar stumbling blocks in statistical work. Depending on the type of the variable, its scale of
measurement type, some statistical methods are meaningful while others are not.
CATEGORICAL ►Nominal Scale: Values are names which cannot be ordered.
Example: Cause of Death
Cancer
• Heart Attack
• Accident
• Other
Example: Gender
• Male
• Female
Example: Race/Ethnicity
• Black
• White
• Latino
• Other
Other Examples: Eye Color, Type of Car, University Attended, Occupation
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 10 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
CATEGORICAL ►Ordinal Scale: Values are attributes (names) that are naturally ordered.
Example: Size of Container
• Small
• Medium
• Large
Example: Pain Level
• None
• Mild
• Moderate
• Severe
For analysis in the computer, both nominal and ordinal data might be stored using numbers rather than text.
Example of nominal: Race/Ethnicity
• 1 = Black
• 2 = White Nominal - The numbers “1”, “2”, etc. have NO meaning
• 3 = Latino They are labels ONLY
• 4 = Other
Example of ordinal: Pain Level
• 1 = None
• 2 = Mild Ordinal – The numbers have LIMITED meaning
• 3 =Moderate 4 > 3 > 2 > 1 says ONLY that “severe” is worse
• 4 = Severe than “moderate” and so on. You cannot do math on these!
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 11 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
NUMERICAL ►Discrete Scale: Values are counts of the number of times some event occurred and are thus
whole numbers: 0, 1, 2, 3, etc.
Examples:
Number of children a woman has had
Number of clinic visits made in one year
The numbers are meaningful. We can actually compute with these numbers.
NUMERICAL ► Continuous: A further classification of data types is possible for numerical data that are
continuous
continuous
Interval
Ratio
NUMERICAL ► Continuous à Interval (“no true zero”): Continuous interval data are generally measured on a
continuum and differences between any two numbers on the scale are of known size but there is no true zero.
Example: Temperature in °F on 4 successive days
Day: A B C D
Temp °F: 50 55 60 65
“5 degrees difference” makes sense. For these data, not only is day A with 50° cooler than day D with 65°, but it is
15° cooler. Also, day A is cooler than day B by the same amount that day C is cooler than day D (i.e., 5°).
“0 degrees cannot be interpreted as absence of temperature”. In fact, we think of 0 degrees as quite cold! Or, we
might think of it as the temperature at which molecules are no longer in motion. Either way, it’s not the same as “0
apples” or “0 Santa Claus”. Thinking about mathematics, for data that are continuous and interval (such as
temperature and time), the value “0" is arbitrary and doesn't reflect absence of the attribute.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 12 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
NUMERICAL ► Continuous à Ratio (“meaningful zero”): Continuous ratio data are also measured on a
meaningful continuum. The distinction is that ratio data have a meaningful zero point.
Example: Weight in pounds of 6 individuals
136, 124, 148, 118, 125, 142
Note on meaningfulness of “ratio”-
Someone who weighs 142 pounds is two times as heavy as someone else
who weighs 71 pounds. This is true even if weight had been measured in kilograms.
In the sections that follow, we will see that the possibilities for meaningful description (tables, charts, means,
variances, etc) are lesser or greater depending on the scale of measurement.
The chart on the next page gives a sense of this idea.
For example, we’ll see that we can compute relative frequencies for a nominal random variable (eg. Hair
color: e.g. “7% of the population has red hair”) but we cannot make statements about cumulative relative
frequency for a nominal random variable (eg. it would not make sense to say “35% of the population has
hair color less than or equal to blonde”)
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 13 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Chart showing data summarization methods, by data type:
All Data Types
Categorical
“qualitative”
Numerical
“quantitative”
Type
Nominal
Ordinal
Discrete
Continuous
Descriptive
Methods
Coming soon!
Unit 2, Data
Visualization
Bar chart
Pie chart
-
-
Bar chart
Pie chart
-
-
Bar chart
Pie chart
Dot diagram
Scatter plot (2 variables)
Stem-Leaf
Histogram
Box Plot
Quantile-Quantile Plot
-
-
Dot diagram
Scatter plot (2 vars)
Stem-Leaf
Histogram
Box Plot
Quantile-Quantile Plot
Numerical
Summaries
This unit!
Unit 1,
Summarizing
Data
Frequency
Relative
Frequency
Frequency
Relative Frequency
Cumulative Frequency
Frequency
Relative Frequency
Cumulative Frequency
means, variances,
percentiles
-
-
-
means, variances,
percentiles
Note This table is an illustration only. It is not intended to be complete.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 14 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
HOMEWORK Due Friday September 23, 2022
Question #2 of 5
For each of the following variables indicate whether it is quantitative or qualitative and specify the
measurement scale that is employed when taking measurements on each:
a) Class standing of members of this class relative to each other.
b) Admitting diagnosis of patients admitted to a mental health clinic.
c) Weights of babies born in a hospital during a year.
d) Gender of babies born in a hospital during a year.
e) Range of motion of elbow joint of students enrolled in a university health sciences
curriculum.
f) Under-arm temperature of day-old infants born in a hospital.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 15 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
4. The Summation Notation
Why this?
The summation notation (and the product notation, by the way) is handy and lots of folks use it! So it’s good to
know. Quite simply, it is nothing more than a secretarial convenience. We use it to avoid having to write out long
expressions.
To get ourselves going,
- Here are five (5) values of age, all in years: 15, 31, 75, 52, and 84
- Now we “tag” or “index” them as follows: X
1
=15, X
2
=31, X
3
=75, X
4
=52, X
5
=84
Here is how summation notation works:
Instead of writing the sum ,
We write
And here is how product notation works:
Instead of writing out the product of five terms ,
We write
This is actually an example of the product notation
The summation notation
The Greek symbol sigma says “add up some items”
Below the sigma symbol is the starting point
Up top is the ending point
Example – Consider the 5 values of age at the top of this page. Using summation notation, what is the
sum of the 2
nd
, 3
rd
, and 4
th
values?
x x x x x
1 2 3 4 5
+ + + +
x
i
i =
å
1
5
x x x x x
1 2 3 4 5
* * * *
x
i
i =
Õ
1
5
å
STARTING HERE
å
END
å
12345
4
i234
i=2
x=15 x =31 x=75 x =52 x=84
x = x + x + x = 31 + 75 + 52 = 158
®
å
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 16 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Additional Resources to Help you Learn Summation Notation
__1. Video. Youtube Tutorial (With apologies - the sound quality is not so great and there is an ad)
PatrickJMT. Summation Notation (Youtube: 10:15)
__2. Columbia University Tutorial
http://www.columbia.edu/itc/sipa/math/summation.html
__3. Khan Academy – Some Exercises to Practice What You Have Learned
https://www.khanacademy.org/math/algebra2/sequences-and-series/copy-of-sigma-
notation/e/evaluating-basic-sigma-notation
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 17 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
HOMEWORK Due Friday September 23, 2022
Question #3 of 5
Let x
1
=3, x
2
=1, x
3
=4, and x
4
=6
3a. Express the following sum in sigma notation and evaluate numerically.
(x
1
+ x
2
+ x
3
+ x
4
)
2
3b. Express the following sum in sigma notation and evaluate numerically.
x
1
2
+ x
2
2
+ x
3
2
+ x
4
2
3c. Evaluate the following numerically.
S (X
i
1)
2
for i=1…4.
3d. Evaluate the following numerically.
S 3X
i
for i=1…4.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 18 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
5. Numerical Summaries for Quantitative Data - Central Tendency
Among the important tools of description are those that address
- What is typical (location or central tendancy)
- What is the scatter (dispersion)
Recall - “Good” choices for summarizing location and dispersion are not always the same and depend on
the pattern of scatter.
Symmetric
Skewed
Bimodal
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 19 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Mode. The mode is the most frequently occurring value. It is not influenced by extreme values. Often, it is not a
good summary of the majority of the data.
Mean. The mean is the arithmetic average of the values. It is sensitive to extreme values.
Mean = sum of values = S (values)
sample size n
Median. The median is the middle value when the sample size is odd. For samples of even sample size, it is the
average of the two middle values. It is not influenced by extreme values.
We consider each one in a bit more detail …
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 20 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
5a. Mode
Mode. The mode is the most frequently occurring value. It is not influenced by extreme values. Often, it is not a
good summary of the majority of the data.
Example
Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
Mode is 4
Example
Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
There are two modes – value 2 and value 5
This distribution is said to be “bimodal”
Modal Class
For grouped data, it may be possible to speak of a modal class
The modal class is the class with the largest frequency
Example – Data set of n=80 values of age (years)
Interval/Class of Values (age, years”)
Frequency, f (# times)
31-40
1
41-50
2
51-60
5
61-70
15
71-80
25
81-90
20
91-100
12
The modal class is the interval of values 71-80 years of age, because values in this range occurred the
most often (25 times) in our data set.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 21 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
5b. Mean
Mean. The mean is the arithmetic average of the values. It is sensitive to extreme values.
Mean = sum of values = S (values)
sample size n
Examples: Calculation of a “mean” or “average” is familiar; e.g. -
grade point average
mean annual rainfall
average weight of a catch of fish
average family size for a region
A closer look using summation notation introduced on page 15
Suppose data are: 90, 80, 95, 85, 65
sample size, n = 5
= sample mean
90+80+95+85+65 415
sample mean = 83
55
==
x x x x x
1 2 3 4 5
= = = = =90 80 95 85 65, , , ,
X
X =
x
n
x x x x x
5
i
i=1
1 2 3 4 5
5
90 80 95 85 65
5
83
å
=
+ + + +
=
+ + + +
=
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 22 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
5c. The Mean as a “Balancing Point” and Introduction to Skewness
The mean can be thought of as a “balancing point”, “center of gravity”
65 80 85 90 95
scale
83
By “balance”, it is meant that the sum of the departures from the mean to the left balance
out the sum of the departures from the mean to the right.
Sum of departures from the mean to the LEFT: (83-65) + (83-80) = 21
Sum of departures from the mean to the RIGHT: (85-83) + (90-83) + (95-83) = 21
In this example, sample mean
TIP!! Often, the value of the sample mean is not one that is actually observed
Skewness
When the data are skewed, the mean is “dragged” in the direction of the skewness
Negative Skewness (Left tail)
mean is dragged left
Positive Skewness (Right tail)
mean is dragged right
https://alevelmaths.co.uk/statistics/skewness/
X = 83
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 23 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
5d. The Mean of Grouped Data
Sometimes, data values occur multiple times and it is more convenient to group the data
than to list the multiple occurrence of “like” values individually.
The calculation of the sample mean in the setting of grouped data is an extension of the
formula for the mean that you have already learned.
Each unique data value is multiplied by the frequency with which it occurs in the sample.
Example
Value of variable X =
Frequency in sample is =
Tip The use of the weighted mean is often used to estimate the mean in a sample of data that have been
summarized in a frequency table. The values used are the interval midpoints. The weights used are the interval
frequencies.
X
1
= 96
1
f20=
X
2
= 84
2
f20=
X
3
= 65
3
f20=
X
4
= 73
4
f10=
X
5
= 94
5
f30=
( )( )
( )
data value frequency of data value
Grouped mean =
frequencies
å
å
n
ii
i=1
i
(f )(X )
=
(f )
å
å
=
( )( ) ( )(84) ( )( ) ( )( ) ( )( )
( ) ( ) ( ) ( ) ( )
20 96 20 20 65 10 73 30 94
20 20 20 10 30
+ + + +
+ + + +
= 84 5.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 24 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
5e. The Median
Median. The median is the middle value when the sample size is odd. For samples of even sample size, it is the
average of the two middle values. It is not influenced by extreme values. Recall:
If the sample size n is ODD
If the sample size n is EVEN
Example
Data, from smallest to largest, are: 1, 1, 2, 3, 7, 8, 11, 12, 14, 19, 20
The sample size, n=11
Median is the
Thus, reading from left (smallest) to right (largest), the median value is = 8
1, 1, 2, 3, 7, 8, 11, 12, 14, 19, 20
Five values are smaller than 8; five values are larger.
luelargest vath
2
1+n
=median
values
2
2+n
,
2
n
of average=median
÷
ø
ö
ç
è
æ
ú
û
ù
ê
ë
é
ú
û
ù
ê
ë
é
thth
n+1 12
th largest = 6th largest value
22
=
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 25 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Example
Data, from smallest to largest, are: 2, 5, 5, 6, 7, 10, 15, 21, 22, 23, 23, 25
The sample size, n=12
Median =
Thus, median value is = the average of [10, 15] = 12.5
Skewed Data – When the data are skewed the median is a better description of the majority than the mean
Example
Data are: 14, 89, 93, 95, 96
Skewness is reflected in the outlying low value of 14
The sample mean is 77.4
The median is 93
Negative Skewness (Left tail)
Positive Skewness (Right tail)
MEAN < Median
MEAN > Median
Mean is dragged left MEDIAN
MEDIAN Mean is dragged right
average
n
2
th largest,
n+2
2
th largest
= average of 6th and 7th largest values
[ ]
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 26 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6. Numerical Summaries for Continuous Data - Dispersion
There are choices for describing dispersion, too. As before, a “good” choice will depend on the shape of the
distribution.
Symmetric
Skewed
Bimodal
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 27 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6a. Variance
Two quick reminders: (1) a parameter is a numerical fact about a population (eg – the average age of every citizen
in the United States population); (2) a statistic is a number calculated from a sample (eg – the average age of a
random sample of 50 citizens).
Population Mean, μ. One example of a parameter is the population mean. It is written as and, for a finite
sized population, it is the average of all the values for a variable, taken over all the members of the population.
Population Variance, σ
2
. The population variance is also a parameter. It is written as s
2
and is a summary
measure of the squares of individual departures from the mean in a population. If we’re lucky and we’re dealing with
a population that is finite in size (yes, it’s theoretically possible to have a population of infinite size … more on this
later) and of size N, there exists a formula for population variance. This formula makes use of the mean of the
population which is represented as .
How to interpret the population variance: It is the average of the individual squared deviations from the
mean. Think of it as answering the question “Typically, how scattered are the individual data points?”
Sample variance, s
2
A sample variance is a statistic; thus, it is a number calculated from the data in a sample. The
sample variance is written as S
2
and is a summary measure of the squares of individual departures from the sample
mean in a sample. For a simple random sample of size n (recall – we use the notation “N” when we speak of the size
of a finite population and we use the notation “n” when we speak of the size of a sample)
Notice that the formula for the sample variance is very similar to the formula for a finite population
variance… (1) N is replaced by (n-1) and the (2) is replaced by . The idea here is
- We are replacing the population N by the “sample size minus 1” (n-1)
- We are replacing the population mean with the sample mean
μ
μ
( )
N
2
i
2
i=1
X-μ
σ =
N
å
( )
n
2
i
2
i=1
X - X
S =
(n-1)
å
μ
X
μ
X
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 28 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Why (n-1) and not simply (n):
This has to do with the long run average of S
2
being equal to its target
6b. Standard Deviation
Standard Deviation, s. The population standard deviation (s)
and a sample standard deviation (S or SD) are the
square roots of s
2
and S
2
. As such, they are additional choices for summarizing variability. The advantage of the
square root operation is that the resulting summary has the same scale as the original values.
Sample Standard Deviation (S or SD)
Disparity between individual and average …
Disparity between individual and average …
The average of these …
The sample variance S
2
is an “almost” average
The related measure S (or SD) returns measure of
dispersion to original scale of observation …
( )
N
2
i
2
i=1
X-μ
σ =
N
å
=
X X
( )
2
n 1
( )X X-
( )X X-
2
( )X X
n
-
å
2
S
2
=
(X X )
2
n 1
S or SD =
(X X )
2
n 1
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 29 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Example of Sample Variance (S
2
) and Standard Deviation (S) Calculation –
Consider the following sample of survival times (X) of n=11 patients after heart transplant surgery. Interest is to
calculate the sample variance and standard deviation.
¨ Patients are identified numerically, from 1 to 11.
¨ The survival time for the “ith” patient is represented as X
i
for i= 1, …, 11.
Patient
Identifier, “i”
Survival (days),
X
i
Mean for sample,
Deviation ,
Squared deviation
1
135
161
-26
676
2
43
161
-118
13924
3
379
161
218
47524
4
32
161
-129
16641
5
47
161
-114
12996
6
228
161
67
4489
7
562
161
401
160801
8
49
161
-112
12544
9
59
161
-102
10404
10
147
161
-14
196
11
90
161
-71
5041
TOTAL
1771
0
285236
Dear Class,
Are you new to this sort of table and not quite sure how to navigate? No worries! Consider the first row:
Key Patient #1 is the person for whom i=1. Patient #1 survived 135 days. So we write X
1
= 135. And so on….
¨ days
¨ Sample mean is days
¨ Sample variance is days
2
¨ Sample standard deviation is days
X
X
i
X
( )
X
i
X
( )
2
X
i
=
=
å
1771
1
11
i
X =
1771
11
= 161
S
2
=
X
i
X
( )
i=1
11
2
n-1
=
285236
10
= 28523.6
s = s
2
= =28523 6 168 89. .
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 30 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6c. Median Absolute Deviation About the Median (MADM)
Median Absolute Deviation about the Median (MADM) - Another measure of variability is helpful when we wish
to describe scatter among data that is skewed.
Recall that the median is a good measure of location for skewed data because
it is not sensitive to extreme values.
Distances are measured about the median, not the mean.
We compute deviations rather than squared differences.
Thus
Median Absolute Deviation about the Median (MADM)
MADM =
Example.
Original data: { 0.7, 1.6, 2.2, 3.2, 9.8 }
Median = 2.2
X
i
| X
i
median |
0.7
1.5
1.6
0.6
2.2
0.0
3.2
1.0
9.8
7.6
MADM = median { 0.0, 0.6, 1.0, 1.5, 7.6 } = 1.0
i1n
median of [ |X - median of {X ,...,X } | ]
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 31 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6d. Standard Deviation (S or SD) versus Standard Error (SE)
TipThe standard deviation (s or sd) and the standard error (se or sem) are often confused.
The standard deviation (SD or S) addresses questions about variability of individuals in nature
(imagine a collection of individuals), whereas
The standard error (SE) addresses questions about the variability of a summary statistic among many
replications of your study (imagine a collection of values of a sample statistic such as the sample
mean that is obtained by repeating your whole study over and over again)
The distinction has to do with the idea of sampling distributions which are introduced on page 34 (stay tuned!)
and which are re-introduced several times throughout this course. Consider the following illustration of the idea.
Example
Suppose you conduct a study that involves obtaining a simple random sample of size n=11. Suppose further that,
from this one sample, you calculate the sample mean (note – you might have calculated other sample statistics, too, such as the
median or sample variance). Now imagine replicating the entire study 5000 times. You would then have 5,000 sample
means, each based on a sample of size n =11.
If instead of replicating your study 5000 times, the study were replicated infinitely many times, the resulting
collection of infinitely many sample means has a name: the sampling distribution of . Notice the
subscript “n=11”. This is a reminder to us that the particular study design that we have replicated infinitely many
times calls for drawing a sample of size n=11 each time.
So what? Why do we care?
We care because, often, we’re interested in knowing if the results of our one study conduct are similar to
what would be obtained if someone else were to repeat it!!
n=11
X
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 32 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Distinction between Standard Deviation (s or sd) and Standard Error of the Mean (se or sem).
We’re often interested in the (theoretical) behavior of the sample mean from one replication of our study to
the next.
So, whereas, the typical variability among individual values can be described using
the standard deviation (SD).
The typical variability of the sample mean from one replication of the study is described using
the standard error (SE) of the mean:
Note – A limitation of the SE is that it is a function of both the natural variation (SD in the numerator) and the
study design (n in the denominator). More on this later!
n
X
SE X
( )
=
SD
n
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 33 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Example, continued
Previously, we summarized the results of one study that enrolled n=11 patients after heart transplant surgery. For
that one study, we obtained an average survival time of
What happens if we repeat the study? What will our next be? Will it be close? How different will it be? We
care about this question because it pertains to the generalizability of our study findings.
The behavior of from one replication of the study
to the next replication of the study is referred to as
the sampling distribution of .
(We could just as well have asked about the behavior of
the median from one replication to the next (sampling distribution
of the median) or the behavior of the SD from one
replication to the next (sampling distribution of SD).)
Thus, interest is in a measure of the “noise” that accompanies The measure we use is the
standard error measure. This is denoted SE. For this example, in the heart transplant study
We interpret this to mean that a similarly conducted study might produce an average survival time that is
near 161 days, give or take 50.9 days.
X = 161 days.
X
X
X
X = 161 days.
SE X
( )
=
SD
n
=
168.89
11
= 50.9
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 34 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6e. A Feel for Sampling Distributions
Top Picture
Here is a population of
individuals in nature.
Each individual in this
population has their own value
of some variable X.
To make this concrete,
suppose X = 2021 income ($)
Suppose we want to know
the average income in this
population
= average income ($)
over entire population
Sample
= sample average
Sample
= sample average
……
sample
= sample average
Bottom Picture
Here we imagine 3 separate
samples drawn from the
population, each with sample
size = n.
We have 3 separate averages
= average 2021
income in
the sample, each
based on a sample
size of n
μ
n
X
n
X
n
X
n
X
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 35 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Source/Population Distribution. The source/population distribution is the pattern of scatter of the individual
incomes x among the entirety of individuals in the population in nature. It could be anything! Here are four (4)
possibilities. In each picture: x-axis (horizontal) = income and y-axis (vertical) = how often
The incomes x are
distributed symmetrically
about some central
value
Or, maybe the distribution
is mostly symmetric but
there are some with really
large incomes (tail to the
right)
Or, maybe for every
distinct income, the
proportion with that
income is the same
Or, maybe the distribution
of income is just some
weird pattern
Source: https://mat117.wisconsin.edu/wp-content/uploads/2014/12/section7-1.png
Distribution of all possible sample means.
Hack: When we talk about all possible individuals in nature we refer to this as the population distribution of
X But, when we talk about all possible sample means of , we refer to this as the sampling distribution of
Sampling Distribution of = Collecting together all possible
This the result of repeating the sampling game over and over and over for forever…
Sampling Distribution. The sampling distribution refers to the distribution of some calculated statistic, taken
over all possible samples drawn from the source population in nature. Here we are talking about the sampling
distribuiton of the sample mean .
Collection of all possible when
sample size for each sampling is n=5
Collection of all possible when
sample size for each sampling is n=10
Collection of all possible when
sample size for each sampling is n=30
Source: https://mse.redwoods.edu/darnold/math15/UsingRInStatistics/CentralLimit3.php
n
X
n
X
n
X
n
X
n
X
n
X
n
X
n
X
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 36 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
There are lots of sampling distributions, actually.
Sampling distribuiton of the mean. So far, we have imagined calculating the sample mean for a sample of data.
From there, we imagined all possible sample means that would be produced by doing the sampling of size=n over
and over and over for forever, each time calculating a new sample mean.. When we collected them all, the result
was the sampling distribution of the mean.
Sampling distribution of the variance. But we might have instead calculated the sample variance for a sample of
data. From there, we can just as easily imagine all possible sample variances that would be produced by doing the
sampling of size=2 over and over for forever, each time calculating a new sample variance.
And so on and so on… “sampling distribution of the median”, “sampling distribution of the estimated slope”, you
get the idea…
Another perspective on standard deviation versus standard error is the following:
Standard Deviation
Describes variation in values of individuals.
In the population of individuals:
Our “guess” is S
Standard Error
Describes variation in values of a statistic from one
conduct of study to the next.
Often, it is the variation in the sample mean that
interests us.
In the population of all possible sample means
(“sampling distribution of mean”):
Our “guess” of the SE of the sample mean is
σ
σ
n
S
n
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 37 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6f. The Coefficient of Variation
The coefficient of variation is the ratio of the standard deviation to the mean of a distribution.
It is a measure of the spread of the distribution relative to the
mean of the distribution
In the population, coefficient of variation is denoted and is defined
The coefficient of variation can be estimated from a sample. Using the
hat notation to indicate “guess”. It is also denoted CV
Example – “Cholesterol is more variable than systolic blood pressure”
Systolic Blood Pressure
15 mm
130 mm
.115
Cholesterol
40 mg/dl
200 mg/dl
.200
Example“Diastolic is relatively more variable than systolic blood pressure”
Systolic Blood Pressure
15 mm
130 mm
.115
Diastolic Blood Pressure
8 mm
60 mm
.133
x
x
s
µ
=
x
cv
X
= =
!
x
S
S
X
cv = = s x
!
x
S
X
cv = = s x
!
x
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 38 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
6g. The Range
The range is the difference between the largest and smallest values in a data set.
It is a quick measure of scatter but not a very good one.
Calculation utilizes only two of the available observations.
As n increases, the range can only increase. Thus, the range is sensitive to sample size.
The range is an unstable measure of scatter compared to alternative summaries of
scatter (e.g. S or MADM)
HOWEVER – when the sample size is very small, it may be a better measure of scatter
than the standard deviation S.
Example –
Data values are 5, 9, 12, 16, 23, 34, 37, 42
range = 42-5=37
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 39 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
HOMEWORK DUE Friday September 23, 2022
Question #4 of 5
The following are behavioral ratings as measured by the Zang Anxiety Scale (ZAS) for 26 persons with
a diagnosis of panic disorder:
53 51 46 45 40 35
59 51 45 60 35
45 38 53 43 31
36 40 41 41 38
69 41 46 38 36
4a. By any means you like. Compute the mean, median, mode, range, variance, and
standard deviation, and the 25th and 75th percentiles.
Tip!!!!!!! See page 50 before you start!!!!
4b. The following are behavioral ratings as measured by the Zang Anxiety Scale (ZAS) for
21 healthy controls:
26 26 25 25 25
28 26 26 25
34 30 31 28
26 34 25 25
25 28 25 25
By any means you like. Compute the mean, median, mode, range, variance, and
standard deviation, and the 25th and 75th percentiles.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 40 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
7. Some Other Important Numerical Summaries
In this section I consider both categorical and numerical data
Example - Consider a study of 25 consecutive patients entering the general medical/surgical intensive
care unit at a large urban hospital.
• For each patient the following data are collected:
Variable Label (Variable) Code
• Age, years (AGE)
• Type of Admission (TYPE_ADM): 1= Emergency
0= Elective
• ICU Type (ICU_TYPE): 1= Medical
2= Surgical
3= Cardiac
4= Other
• Systolic Blood Pressure, mm Hg (SBP)
• Number of Days Spent in ICU (ICU_LOS)
• Vital Status at Hospital Discharge (VIT_STAT): 1= Dead
0= Alive
The actual data are provided on the following page.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 41 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
ID
Age
Type_Adm
ICU_Type
SBP
ICU_LOS
Vit_Stat
1
15
1
1
100
4
0
2
31
1
2
120
1
0
3
75
0
1
140
13
1
4
52
0
1
110
1
0
5
84
0
4
80
6
0
6
19
1
1
130
2
0
7
79
0
1
90
7
0
8
74
1
4
60
1
1
9
78
0
1
90
28
0
10
76
1
1
130
7
0
11
29
1
2
90
13
0
12
39
0
2
130
1
0
13
53
1
3
250
11
0
14
76
1
3
80
3
1
15
56
1
3
105
5
1
16
85
1
1
145
4
0
17
65
1
1
70
10
0
18
53
0
2
130
2
0
19
75
0
3
80
34
1
20
77
0
1
130
20
0
21
52
0
2
210
3
0
22
19
0
1
80
1
1
23
34
0
3
90
3
0
24
56
0
1
185
3
1
25
71
0
2
140
1
1
Categorical (Qualitative) data:
Type of Admission (Type_Adm)
ICU Type (ICU_Type)
Vital Status at Hospital Dicharge (Vit_Stat)
Numerical (Quantitative) data:
Age, years (Age)
Number of days spent in ICU (ICU_LOS)
Systolic blood pressure (SBP)
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 42 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
7a. Frequencies, Relative Frequencies and More
For nominal variables, we can compute frequencies and relative frequencies.
A tally of the possible outcomes, together with “how often” and “proportionately often” is called a frequency and
relative frequency distribution.
¨ Appropriate for - nominal, ordinal, count data types.
¨ For the variable ICU_Type, the frequency distribution is the following:
Frequency & Relative Frequency Table
ICU_Type
Frequency (“how often”)
Relative Frequency (“proportionately often”)
Medical
12
0.48
Surgical
6
0.24
Cardiac
5
0.20
Other
2
0.08
TOTAL
25
1.00
¨ This summary will be useful in constructing two graphical displays, the bar chart and the pie
chart.
For ordinal variables, we can compute frequencies and relative frequencies + cumulative frequencies and
cumulative relative frequencies.
The Glasgow Coma Score (GCS) measures severity of a coma on an ordinal scale, with lower values corresponding
to greater severity of coma. Suppose we have this information for 35 patients. The following table tallies the
number of patients with each GCS score (frequency) together with the proportion of the sample of patients with
each score (relative frequencies). But it also tallies what are called cumulative tallies and allows us to answer such
questions as “how many patients have a GCS score of 5 or less?” (cumulative frequency) and “what proportion
of the sample have GCS scores of 5 or less?” (cumulative relative frequency)
Frequency, Cumulative Frequency, Relative Frequency, Cumulative Relative Frequency Table
GCS Score
Frequency
(“how often”)
Relative Frequency
(“proportionately often”)
Cumulative
Frequency
Cumulative
Relative Frequency
3
10
.285
10
.285
4
5
.143
15 (=10+5)
.429
5
6
.171
21
.600
6
2
.057
23
.719
7
12
.343
35
1.000
TOTAL
35
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 43 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
HOMEWORK DUE Friday September 23, 2022
Question #5 of 5
The following table shows the age distribution of cases of a certain disease reported during a year in a
particular state.
___________________________________________________________
Age Number of Cases
____________________________________________________________
5-14 5
15-24 10
25-34 20
35-44 22
45-54 13
55-64 5
______________________________________________________________
TOTAL 75
5a. By any means you like. Construct a frequency table with columns for class endpoints,
class midpoint, frequency, relative frequency, cumulative frequency, and cumulative
relative frequency.
5b. By any means you like. Estimate the values of the mean, median, variance, and
standard deviation.
Tip -
Use the midpoints of each age interval as your values and use number of cases as
their frequencies. For example, the value 10 has an estimated frequency of 5, the
value 20 has an estimated frequency of 10, and so on.
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 44 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
7b. Percentiles (and Quantiles)
Percentiles are one way to summarize the range and shape of values in a distribution. Percentile values
communicate various “cut-points”. For example:
Suppose that 50% of a cohort survived at least 4 years.
This also means that 50% survived at most 4 years.
We say 4 years is the median.
The median is also called the 50
th
percentile, or the 0.50 quantile. We write
P
50
= 4 years.
Similarly we could speak of other percentiles:
P
25
: 25% of the sample values are less than or equal to this value. This is the 0.25 quantile
P
75
: 75% of the sample values are less than or equal to this value. This is the 0.75 quantile
P
0
: The minimum.
P
100
: The maximum.
It is possible to estimate the values of percentiles from a cumulative frequency polygon (no worries we’ll
come to this in Unit 2, Data Visualization).
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 45 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Example Consider P
10
= 18. It is interpreted as follows: “10% of the sample is age < 18” or “The 10
th
percentile of age in this sample is 18 years”.
How to Determine the Values of Q1, Q2, Q3 – the 25
th
, 50
th
, and 75
th
Percentiles in a Data Set
Often, it is the quartiles we’re after. An easy solution for these is the following. Obtain the median of the entire
sample. Then obtain the medians of each of the lower and upper halves of the distribution.
Step 1 - Preliminary:
Arrange the observations in your sample in order, from smallest to largest, with the smallest observation at the left.
Step 2 – Obtain median of entire sample:
Solve first for the value of Q2 = 50
th
percentile (“median):
Sample Size is ODD
Sample Size is EVEN
Q2 = 50
th
Percentile
(“median”)
Step 3 – Q1 is the median of the lower half of the sample:
To obtain the value of Q1 = 25
th
percentile, solve for the median of the lower 50% of the sample.
Step 4 – Q3 is the median of the upper half of the sample:
To obtain the value of Q3 = 75
th
percentile, solve for the median of the upper 50% of the sample:
Q2 =
n+1
2
th
ordered observation
Q2 = average
n
2
,
n
2
+ 1
st
ordered observation
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 46 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Example
Consider the following sample of n=7 data values
1.47
2.06
2.36
3.43
3.74
3.78
3.94
Solution for Q2
Solution for Q1
The lower 50% of the sample is thus, the following
1.47 2.06 2.36 3.43
Solution for Q3
The upper 50% of the sample is the following
3.43 3.74 3.78 3.94
How to determine the values of other Percentiles in a Data Set
Important Note Unfortunately, there exist multiple formulae for doing this calculation. Thus, there is no single
correct method
Consider the following sample of n=40 data values
0
87
173
253
1
103
173
256
1
112
198
266
3
121
208
277
17
123
210
284
32
130
222
289
35
131
227
290
44
149
234
313
48
164
245
477
86
167
250
491
Q2 = 50
th
Percentile =
7+1
2
th
= 4
th
ordered observation
= 3.43
Q1 = 25
th
Percentile = average
4
2
,
4
2
+ 1
st
= average 2nd,3rd observation
[ ]
= average(2.06, 2.36) = 2.21
Q3 = 75
th
Percentile = average
4
2
,
4
2
+ 1
st
= average 2nd, 3rd observation
[ ]
= average(3.74, 3.78) = 3.76
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 47 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Step 1:
Order the data from smallest to largest
Step 2:
Compute where
n = size of sample (eg; n=40 here)
p = desired percentile (eg p=25
th
)
L is NOT a whole number L is a whole number
Step 3:
Change L to next whole
number.
Pth percentile = Lth
ordered value in the data
set.
Step 3:
Pth percentile = average of the
Lth and (L+1)st ordered value in
the data set.
p
L = n
100
éù
êú
ëû
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 48 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
7c. Five Number Summary
A “five number summary” of a set of data is, simply, a particular set of five percentiles:
P
0
: The minimum value.
P
25
: 25% of the sample values are less than or equal to this value.
P
50
: The median. 50% of the sample values are less than or equal to this value.
P
75
: 75% of the sample values are less than or equal to this value.
P
100
: The maximum.
Why bother? This choice of five percentiles is actually a good summary, since:
The minimum and maximum identify the extremes of the distribution, and
The 1
st
and 3
rd
quartiles identify the middle “half” of the data, and
Altogether, the five percentiles are the values that define the quartiles of the distribution, and
Within each interval defined by quartile values, there are an equal number of observations.
Example, continued –
We’re just about done since on page 46, the solution for P
25
, P
50
, and P
75
was shown. Here is the data again.
1.47
2.06
2.36
3.43
3.74
3.78
3.94
Thus,
P
0
= the minimum value = 1.47
P
25
= 1
st
quartile = 25
th
percentile = 2.21
P
50
= 2
nd
quartile = 50
th
percentile (median) = 3.43
P
75
= 3
rd
quartile = 75
th
percentile = 3.76
P
100
= the maximum value = 3.94
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 49 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
7d. Interquartile Range (IQR)
The interquartile range is simply the difference between the 1
st
and 3
rd
quartiles:
The IQR is a useful summary also:
It is an alternative summary of dispersion (sometimes used instead of standard deviation)
The range represented by the IQR tells you the spread of the middle 50% of the sample values
Example, continued –
Here is the data again.
1.47
2.06
2.36
3.43
3.74
3.78
3.94
P
0
= the minimum value = 1.47
P
25
= 1
st
quartile = 25
th
percentile = 2.21
P
50
= 2
nd
quartile = 50
th
percentile (median) = 3.43
P
75
= 3
rd
quartile = 75
th
percentile = 3.76
P
100
= the maximum value = 3.94
IQR = Interquartile Range = [ P75 – P25 ] = [ 3.76 – 2.21 ] = 1.55
IQR = Interquartile Range = P
75
-P
25
[ ]
BIOSTATS 540 Fall 2022 1. Summarizing Data Page 50 of 50
Nature
Population/
Sample
Observation/
Data
Relationships/
Modeling
Activity Introduction to “Art of Stat”
Recall again the behavioral ratings data of question #3b on page 39:
26 26 25 25 25
28 26 26 25
34 30 31 28
26 34 25 25
25 28 25 25
In question #4b, you were asked to compute by hand the values of the following sample statistics:
mean, median, mode, range, variance, and standard deviation, and the 25th and 75th percentiles.
In this exercise (activity, really), I invite you to play with this same data in a wonderful
online application.
Step 1Launch "artofstat.com" and click at right on Online Web Apps
Step 2 - From the main welcome window, middle, click on
EXPLORE QUANTITATIVE DATA
Step 3 - From this menu, at top left, under ENTER DATA, choose YOUR OWN
Step 4 - Now enter your data. One way is to do this by hand. Alternatively, if you're lazy
(like me), you could also highlight to select the data in these course notes and
then do an EDIT>COPY (control-C) followed by an EDIT>PASTE.
Tip - Not sure, but you might just have to do a little tweaking to be sure the
data values themselves are separated by just one space
AND NOW - just play around with this app and enjoy.