5 Practical Data Analysis

5

Practical Data Analysis – Examples

Scatterplot Scatterplot matrices can give useful insights on

matrices data that will be used for regression or related

calculations.

Transformation Data often require transformation prior to entry

into a regression model.

Model Fitting a regression or other such model gives,

objects in the ﬁrst place, a model object.

Generic plot(), print() and summary() are examples

functions of generic functions. With a dataframe as

argument plot() gives a scatterplot matrix.

With an lm object, it gives diagnostic plots.

Extractor Use an extractor function to extract output from

function a model object. Extractor fucntions are generic

functions

List objects An lm model object is a list object. Lists are

used extensively in R.

This chapter will use examples to illustrate common issues in

the exploration of data and the ﬁtting of regression models. It will

round out the discussion of Chapters 1 to 2 and 4 by adding some

further important technical details. These include the use of generic

functions such as plot() and print(), the way that regression

model objects are structured, and the use of extractor functions to

extract information from model objects.

5.1 The Uses of Scatterplot Matrices

The MASS package will be needed, for the dataset mammals.

library(MASS, quietly=TRUE)

40 data analysis , graphics, and visualisation using r

5.1.1 Transformation to an appropriate scale

Among other issues, is there a

wide enough spread of distinct

values that data can be treated as

continuous.

A ﬁrst step is to elicit basic information on the columns in the data,

including information on relationships between explanatory vari-

ables. Is it desirable to transform one or more variables?

Transformations are helpful that ensure, if possible, that:

• All columns have a distribution that is reasonably well spread out

over the whole range of values, i.e., it is unsatisfactory to have

most values squashed together at one end of the range, with a

small number of very small or very large values occupying the

remaining part of the range.

• Relationships between columns are roughly linear.

• the scatter about any relationship is similar across the whole range

of values.

It may happen that the one transformation, often a logarithmic trans-

formation, will achieve all these at the same time.

The scatterplot in Figure 5.1A, showing data from the dataset

mammals (MASS), is is an extreme version of the common situation

where positive (or non-zero) values are squashed together in the

lower part of the range, with a tail out to the right. Such a distribu-

tion is said to be “skewed to the right”.

Code for Figure 5.1A

plot (brain ~ body, data=mammals)

mtext(side=3, line=0.5, adj=0, "A: Unlogged data")

●

0 2000 4000 6000

0 2000 4000

body

brain

A: Unlogged data

●

1e−02 1e+02

1e−01 1e+01 1e+03

body

brain

B: Log scales (both axes)

Figure 5.1: Brain weight (g)

versus Body weight (kg), for 62

species of mammal. Panel A plots

the unlogged data, while Panel B

has log scales for both axes, but

with axis labels in the original

(unlogged) units.

Figure 5.1B shows the scatterplot for the logged data. Code for

Figure 5.1B is:

plot (brain ~ body, data=mammals, log ="xy")

mtext(side=3, line=0.5, adj=0,

"B: Log scales (both axes)")

Where, as in Figure 5.1A, values are concentrated at one end of

the range, the small number (perhaps one or two) of values that lie at

the other end of the range will, in a straight line regression with that

column as the only explanatory variable, be a leverage point. When

it is one explanatory variable among several, those values will have

an overly large say in determining the coeﬃcient for that variable.

As happened here, a logarithmic transformation will often re-

move much or all of the skew. Also, as happened here, such transfor-

mations often bring the added bonus that relationships between the

resulting variables are approximately linear.

5.1.2 The Uses of Scatterplot Matrices

In subsequent chapters, there will be extensive use of scatterplot

matrices. A scatterplot matrix plots every column against every

practical data analysis – examples 41

other column, showing the result in the layout used for correlation

matrices. Figure 5.2 shows a scatterplot matrix for the trees dataset

in the datasets package that is automatically attached when R

starts.

Interpreting Scatterplot Matrices:

For identifying the axes for each panel

- look across the row to the diagonal to identify the

variable on the vertical axis.

- look up or down the column to the diagonal for the

variable on the horizontal axis.

Each below diagonal panel is the mirror image of the

corresponding above diagonal panel.

## Code used for the plot

plot (trees , cex.labels=1.5)

# Calls pairs (trees)

Girth

65 75 85

●

● ●

●

●●

●

● ●

●

●●

●

8 12 16 20

●

●●

●

●●

●

● ●

●

●●

●

65 75 85

●

●●

●

●●●

●

Height

●

●●

●

●●●

●

8 12 16 20

●●

●

●●

●

10 30 50 70

Volume

Figure 5.2: Scatterplot matrix for

the trees data, obtained using

the default plot() method for

data frames. The scatterplot matrix

is a graphical counterpart of the

correlation matrix.

Notice that plot(), called with the dataframe trees, has

in turn called the plot method for a data frame, i.e., it has called

plot.data.frame() which has in turn called the function

pairs().

The scatterplot matrix may be examined, if there are enough

points, for evidence of: The scatterplot matrix should

be used only as an initial coarse

screening device. Skewness in the

individual distributions is better

checked with the use of plots of

density estimates.

1. Strong clustering in the data, and/or obvious outliers;

2. Clear non-linear relationships, so that a correlation will underesti-

mate the strength of any relationship;

3. Severely skewed distributions, so that the correlation is a biased

measure of the strength of relationship.

5.2 World record times for track and ﬁeld events

The ﬁrst example is for world track and road record times,

as at 9th August 2006. Data are from the web page http: Note also the use of these data in

the exercise at the end of Chapter

2 (Section 2.6.2)

//www.gbrathletics.com/wrec.htm. Data are in the dataset

worldRecords in the DAAG package).

Data exploration

First, use str() to get information on the data frame columns:

library(DAAG, quietly=TRUE)

42 data a naly si s, graphics, and visualisation using r

str (world Records , vec.len=3)

' data.frame' :40obs.of5variables:

$Distance :num 0.10.150.20.30.40.50.60.8...

$roadORtrack:Factorw/2levels"road","track":22222222...

$Place :chr "Athens""Cassino""Atlanta"...

$Time :num 0.1630.2470.3220.514...

$Date :Date,format:"2005-06-14"...

Distinguishing points for track events from those for road events

is easiest if we use lattice graphics, as in Figure 5.3.

## Code

library(lattice)

xyplot (Time ~ Distance , scales =list (tck=0.5),

groups =roadORtrack, data =worldRecords ,

auto.key=list(columns=2), aspect=1)

## On a a colour device the default is to use

## different colours, not different symbols ,

## to distinguish groups.

Distance

Time

0

500

1000

1500

0 50 100 150 200 250 300

●

●●

●

●●

●

road track

● ●

Figure 5.3: World record times

versus distance, for ﬁeld and road

events.

Clearly increases in Time are not proportional to increases in

Distance. Indeed, such a model does not make sense; velocity

decreases as the length of the race increases. Proportionality when

logarithmic scales are used for the two variables does make sense.

Figure 5.4 uses logarithmic scales on both axes. The two panels

di↵er only in the labeling of the scales. The left panel uses labels on

scales of log

e

, while the right panel has labels in the orginal units.

Notice the use of auto.key to obtain a key.

log(Distance)

log(Time)

−2

0

2

4

6

−2 0 2 4 6

●

road track

● ●

Distance

Time

10^0

10^1

10^2

10^3

10^−1 10^0 10^1 10^2

●

road track

● ●

Figure 5.4: World record times

versus distance, for ﬁeld and road

events, using logarithmic scales.

The left panel uses labels on scales

of log

e

, while in the right panel,

labeling is in the orginal units,

expressed as powers of 10.

## Code for Left panel

xyplot (log (Time) ~ log (Distance),

groups =roadORtrack, data =worldRecords ,

scales =list (tck=0.5),

auto.key=list(columns=2), aspect=1)

## Right panel

xyplot (Time ~ Distance , groups =roadORtrack,

data =worldRecords,

practical data analysis – examples 43

scales =list (log =10 , tck =0 .5 ),

auto.key=list(columns=2), aspect=1)

Fitting a regression line

The plots suggest that a line is a good ﬁt. Note however that the

data span a huge range of distances. The ratio of longest to shortest

distance is almost 3000:1. Departures from the line are of the order

of 15% at the upper end of the range, but are so small relative to this

huge range that they are not obvious.

The following uses the function lm() to ﬁt a straight line ﬁt to The name lm is a mnemonic for

linear model.

the logged data, then extracting the regression coeﬃcients:

The equation gives predicted

times:

[

Time = e

0.7316

⇥ Distance

1.1248

= 2.08 ⇥ Distance

1.1248

This implies, as would be ex-

pected, that kilometers per minute

increase with increasing distance.

Fitting a line to points that are on a

log scale thus allows an immediate

interpretation.

worldrec.lm <- lm (log (Time) ~ log (Distance),

data =worldRecords)

coef (worldrec.lm)

(Intercept) log(Distance)

0.7316 1.1248

There is no di↵erence that can be detected visually between the

track races and the road races. Careful analysis will in fact ﬁnd no

di↵erence.

5.2.1 Summary information from model objects

In order to avoid recalculation of the model information each time The name worldrec.lm is used

to indicate that this is an lm object,

with data from worldRecords.

Use any name that seems helpful!

that some di↵erent information is required, we store the result from

the lm() calculation in the model object worldrec.lm.

Plot points; add line:

plot (log(Time) ~ log (Distance),

data =worldRecords)

abline (worldrec.lm)

Note that the function abline() can be used with the model

object as argument to add a line to the plot of log(Time) against

log(Distance).

Diagnostic plots

Insight into the adequacy of the line can be obtained by examining

the “diagnostic” plots, obtained by “plotting” the model object.

Figure 5.5 following shows the ﬁrst and last of the default plots:

## Code

plot (worldrec.lm, which =c(1 ,5) ,

sub.caption=rep ("" ,2))

By default, there are four “diagnostic” plots. Panel A is designed

to give an indication whether the relationship really is linear, or

whether there is some further systematic component that should

perhaps be modeled. It does show systematic di↵erences from a

line. The largest of these is more than a 15% di↵erence.

1

There are

1

A di↵erence of 0.05 on a scale of

log

e

translates to a di↵erence of

just over 5%. A di↵erence of 0.15

translates to a di↵erence of just

over 16%, i.e., slightly more than

15%.

mechanisms for using a smooth curve to account for the di↵erences

from a line, if these are thought important enough to model.

44 data a naly si s, graphics, and visualisation using r

−2 0 2 4 6

−0.10 0.00 0.10

Fitted values

Residuals

●

Residuals vs Fitted

40

39

24

0.00 0.04 0.08 0.12

−2 0 1 2 3

Leverage

Standardized residuals

●

Cook's distance

0.5

1

Residuals vs Leverage

40

39

24

Figure 5.5: First and last of the

default diagnostic plots, from the

linear model for log(record time)

versus log(distance), for ﬁeld and

road events.

The plot in panel B allows an assessment of the extent to which

individual points are inﬂuencing the ﬁtted line. Observation 40 does

have both a very large leverage and a large Cook’s distance. The plot

on the left makes it clear that this is the point with the largest ﬁtted

time. Observation 40 is for a 24h race, or 1440 min. Examine

worldRecords["40", ]

Distance roadORtrack Place Time Date

40 290 .2 road Ba sle 144 0 199 8 -05 - 03

5.2.2 The model object

Functions that are commonly used to get information about model

objects are: print(), summary() and plot(). These are all

generic functions. The e↵ect of the function depends on the class

of object that is printed (ie, by default, displayed on the screen) or or

plotted, or summarized.

The function print() may display relatively terse output, while

summary() may display more extensive output. This varies from

one type of model object to another.

Compare the outputs from the following:

print(worldrec.lm) # Alternatively, type worldrec.lm

Call :

lm ( for mul a = l og( T ime ) ⇠ log ( Dista nce ), data = worl dRecord s)

Coefficients:

(Intercept) log(Distance)

0.732 1.125

summary(worldrec.lm)

practical data analysis – examples 45

Call :

lm ( for mul a = l og( T ime ) ⇠ log ( Dista nce ), data = worl dRecord s)

Residuals :

Min 1Q Me dian 3Q Max

-0.0807 -0.0497 0.0028 0.0377 0.1627

Coefficients:

Estimate Std. Error t value Pr( >|t |)

(Intercept) 0.73160 0.01241 59 <2e-16

log (Dist ance ) 1.12475 0.00437 257 <2e -16

Residual standard error: 0.0565 on 38 degrees of freedom

Multiple R

2

:0.999, AdjustedR

2

:0.999

F-statistic: 6.63e+04 on 1 and 38 DF, p-value: <2e-16

Internally, summary(wtvol.lm)

calls UseMethod("summary"). As

wtvol.lm is an lm object, this calls

summary.lm().

Used with lm objects, print() calls print.lm(), while

summary() calls summary.lm(). Note that typing worldrec.lm

has the same e↵ect as print(worldrec.lm).

5.2.3 The lm model object is a list

The model object is actually a list. Here are the names of the list

elements:

names(worldrec.lm)

[1] " coef fici ents " "residua ls " " effe cts "

[4] " rank" " fitte d. val ues " " ass ign "

[7] "qr " " df . resi dual " "x levels "

[10] "call" " terms" " model"

These di↵erent list elements hold all very di↵erent classes and di-

mensions (or lengths) of object. Hence the use of a list; any collec-

tion of di↵erent R objects can be brought together into a list.

The following is a check on the model call:

worldrec.lm$call

lm ( for mul a = l og( T ime ) ⇠ log ( Dista nce ), data = worl dRecord s)

Commonly required information is best accessed using generic

extractor functions.

Use extractor function coef():

coef (worldrec.lm)

Above, attention was drawn to print(), summary()

and plot(). Other commonly used extractor functions are

residuals(), coefficients(), and fitted.values(). These

can be abbreviated to resid(), coef(), and fitted().

46 data a naly si s, graphics, and visualisation using r

5.3 Regression with two explanatory variables

The dataset nihills in the DAAG package will be used for a regres-

sion ﬁt in Section 11.6. This has record times for Northern Ireland

mountain races. Overview details of the data are:

str (nihi lls )

' data.frame' :23obs.of4variables:

$dist:num 7.54.25.96.854.84.332.512...

$climb:int 174011101210330012009501600150015005080...

$time:num 0.8580.4670.7031.0390.541...

$timef:num 1.0640.6230.8871.2140.637...

The function splom() is a lattice

alternative to pairs(), giving a

di↵erent panel layout.

Figure 5.6 uses the lattice function splom() (from the lattice

package) to give scatterplot matrices, one for the unlogged data,

and the other for the logged data. The left panel shows the unlogged

data, while the right panel shows the logged data:

A: Untransformed data

dist

10

15

10 15

5

10

5 10

●

●●

●

climb

6000

8000

6000

2000

4000

2000

●

●●

●

●●

●

time

3

4

3 4

1

2

1 2

●

timef

4

5

6

4 5 6

1

2

3

1 2 3

B: Log transformed data

ldist

2.0

2.5

3.0

2.0 2.5 3.0

1.0

1.5

2.0

1.0 1.5 2.0

●

0.78

●

0.945

●

0.933

●

●●

●

lclimb

8.0

8.5

9.0

8.08.5 9.0

7.0

7.5

8.0

7.07.5 8.0

●

●●

●

0.924

●

●●

●

0.92

●

ltime

0.0

0.5

1.0

0.00.51.0

−1.0

−0.5

0.0

−1.0 0.0

●

0.994

●

ltimef

0.5

1.0

1.5

0.5 1.5

−1.0

−0.5

0.0

−1.0 0.0

Figure 5.6: Scatterplot matrices

for the Northern Ireland mountain

racing data. The left panel is for

the unlogged data, while the right

panel is for the logged data. Code

has been added that shows the

correlations, in the lower panel.

The following panel function was used to show the correlations:

showcorr <- function (x,y,...){

panel.xyplot(x,y,...)

xy <- current.panel.limits()

rho <- paste (round (cor(x ,y) ,3))

eps <- 0.035* diff( range(y))

panel.text(max (x), min (y)+eps, rho ,

pos =2 , offset =-0.2)

}

Code for the scatterplot matrix in the left panel is:

## Scatterplot matrix ; unlogged data

library(lattice)

practical data analysis – examples 47

splom(~ nihills , xlab ="" ,

main =list ("A: Untransformed data", x=0,

just ="left", fontface =" plain "))

For the right panel, create a data frame from the logged data:

lognihills <- log (nihills)

names(lognihills) <- paste0 ("l" , names (nihills))

## Scatterplot matrix ; log scales

splom(~ lognihills, lower.panel =showcorr , xlab ="",

main =list ("B: Log transformed data", x=0,

just ="left", fontface =" plain "))

Unlike paste(), the function

paste0() does not leave spaces

between text strings that it pastes

together.

Note that the data are positively skewed, i.e., there is a long tail

to the right, for all variables. For such data, a logarithmic transfor-

mation often gives more nearly linear relationships. The relation-

ships between explanatory variables, and between the dependent

variable and explanatory variables, are closer to linear when loga-

rithmic scales are used. Just as importantly, issues with large lever-

age, so that the point for the largest data values has a much greater

leverage and hence much greater inﬂuence than other points on the

the ﬁtted regression, are greatly reduced.

Notice also that the correlation of 0.913 between climb and

dist in the left panel of Figure 5.6 is very di↵erent from the corre-

lation of 0.78 between lclimb and ldist in the right panel. Corre-

lations where distributions are highly skew are not comparable with

correlations where distributions are more nearly symmetric. The

statistical properties are di↵erent.

The following regresses log(time) on log(climb) and

log(dist):

nihills.lm <- lm (ltime ~ lclimb + ldist ,

data =lognihills)

5.4 One-way Comparisons

Note that, in an experiment such

as this, the plants should be grown

in separate pots, with a random

arrangement of pots.

The dataset tomato has weights of plants that were grown under one

of four di↵erent sets of experimental comditions. Five plants were

grown under each of the treatments:

- water only

- conc nutrient

- 2-4-D + conc nutrient

- x conc nutrient

Notice that “water only” has been made the referrence level. This

is done as a preferred starting point for the analysis of variance

calculations that appear below. Figure 5.7, created using the function

48 data a naly si s, graphics, and visualisation using r

quickplot() from the ggplot2 package, shows the plant weights.

There do seem to be some fairly clear di↵erences.

● ●● ● ●●

●●● ● ●●

●●● ●● ●

● ● ●●● ●

water only

2−4−D + conc nutrient

3x conc nutrient

conc nutrient

1.0 1.5 2.0 2.5 3.0

Weight (g)

Figure 5.7: Weights (g) of tomato

plants grown under four di↵erent

treatments.

## Code

library(ggplot2)

tomato <- within (tomato , trt <- relevel(trt, ref="water only"))

quickplot (weight , trt, data =tomato, xlab="Weight (g)", ylab="")

Observe that, to get estimates

and SEs of treatment e↵ects,

tomato.aov can be treated as an

lm (regression) object.

The command aov(), followed by a call to summary.lm(), can

be used to analyse these data, thus:

tomato.aov <- aov (weight ~ trt, data =tomato)

round(coef (summary.lm(tomato.aov)), 3)

Estimate Std. Error t value Pr( >|t |)

(Intercept) 1.683 0.187 9.019 0.000

trt2-4 -D + con c n utr ien t -0.3 58 0 .264 -1.35 8 0.19 0

trt3x conc nutrient -0.700 0.264 -2.652 0.015

trtconc nutrient 0.067 0.264 0.253 0.803

Because we made “water only” the reference level,

“(Intercept)” is the mean for water only, and the other coeﬃ-

cients are for di↵erences from water only. None of the treatments

can be distinguished statistically from water only.

Growing conditions in a laboratory or glasshouse or growth

chamber — temperature, humidity and air movement — are rarely

totally uniform. The preferred way to deal with this is to repeat the

comparison between treatments in several di↵erent locations.

2

Con-

2

Technically, each di↵erent loca-

tion (or set of conditions) is known

as a block.

ditions will vary between locations, but each comparison between

treatments is conducted under the relatively uniform conditions at

one particular location.

The dataset rice (DAAG) is from such experiment. There are

two treatment factors — three types of fertilizer and two varieties

of rice plant. The six treatment combinations are each repeated six

times, in each of two blocks.

For these data, Figure 5.8 gives a clear picture of the result. For

fertilizers NH4Cl and NH4NO3, any di↵erence between the varieties

is inconsequential. There is strong “interaction” between fert and

practical data analysis – examples 49

variety. A formal analysis will conﬁrm what is already rather

clear. Experimenters are rarely thus fortunate.

20 40 60 80 100

fert

mean of ShootDryMass

F10 NH4Cl NH4NO3

variety

wt

ANU843

## Code

library(DAAG)

with (rice , inte raction.plot ( x.factor =fert ,

trace.factor=variety ,

ShootDryMass,

cex.lab =1 .4))

Figure 5.8: Interaction plot for the

terms fert and variety, with

ShootDryMass as the dependent

variable. Notice that for fertilizer

F10, there is a huge variety di↵er-

ence in the response. For the other

fertilizers, there is no di↵erence of

consequence.

5.5 Time series – Australian annual climate data

Data are from the website

http://www.bom.gov.au/

climate/change/

The data frame bomregions2012 from the DAAG package has

annual rainfall data, both an Australian average and broken down by

location within Australia, for 1900 – 2012. Figure 5.9 shows annual

rainfall in the Murray-Darling basin, plotted against year.

●

1900 1920 1940 1960 1980 2000

300 400 500 600 700 800

Year

mdbRain

Figure 5.9: Annual rainfall in

the Australian Murray-Darling

Basin. by year. The lowess()

function is used to The dashed

curve with f=2/3 captures the

overall trend, while the solid curve

with f=0.1 captures trends on a

scale of around eleven years. (10%

of the 113 year range from 1900

to 2012 is a little more than 11

years.)

## Code

library(DAAG)

plot (mdbRain ~ Year, data =bomregions2012)