3. Linear Regression Flashcards

Sample covariance and correlation, least square regression, alternative regression models

1
Q

What is regression used for?

A
  • many data sets have observations of several variables for each individual
  • the aim of regression is to ‘predict’ the value of one variable, y, using observations from another variable, x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is linear regression used for?

A

-linear regression is used for numerical data and uses a relation in the form:
y ≈ α + βx
-in a plot of y as a function of x, this relation describes a straight line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Paired Samples

A
  • to fit a linear model we need observations of x and y
  • it is important that these are paired samples, i.e. that for each iϵ{1,..,n} the observations xi and yi belong to the same individual
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Examples of Paired Samples

A
  • weight and height of a person

- engine power and fuel consumption of a car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear Regression

Constructing a Model

A

-assume we have observed data (xi,yi) for iϵ{1,…,n}
-to construct a model for these data, we use random variables Y1,…,Yn such that:
Yi = α + βxi + εi
-for all iϵ{1,…,n} where ε1,…,εn are i.i.d random variables with E(εi)=0 and Var(εi)=σ²
-here we assume that the x-values are fixed and known
-thus the only random quantities in the model are Yi and εi
-the values α, β and σ² are parameters of the model, to fit the model to data we need to estimate these parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Linear Regression

Residuals/Errors

A

-starting with the model:
Yi = α + βxi + εi
-the random variables εi are called residuals or errors
-in a scatter plot they correspond to the vertical distance between the samples and the regression line
-often we assume that εi~N(0,σ²) for all i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Regression

Expectation of Yi

A

-we have the linear regression model:
Yi = α + βxi + εi
-then the expectation is given by:
E(Yi) = E(α + βxi + εi)
-the expectation of a constant is just the constant itself, and remember that xi represents a known value here:
E(Yi) = α + βxi + E(εi)
-recall that εi are modeled as random variables with E(εi)=0:
E(Yi) = α + βxi
-thus the expectation of Yi depends on xi and, at least for β≠0, the random variables Yi are not identically distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are sample covariance and correlation used for?

A

-to study the dependence between paired numeric variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sample Covariance

Definition

A

-the sample covariance of x1,…,xnϵℝ and y1,…,ynϵℝ is given by:
σxy = 1/(n-1) Σ(xi-x^)(yi-y^)
-where the sum is taken from i=1 to i=n, and x^ and y^ are the sample means

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sample Correlation

Definition

A

-the sample correlation of x1,…,xnϵℝ and y1,…,ynϵℝ is given by:
ρxy = σxy / √( σx² σy²)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the sample covariance of a sample with itself?

A
  • we can show that the sample covariance sample with itself equals the sample variance
  • i.e. Cov(X,X) = Var(X)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What values can correlation take?

A

-the correlation of two samples is always in the interval [-1,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Interpreting Correlation

ρxy≈1

A
  • strong positive correlation, ρxy≈1 indicates that the points (xi,yi) lie close to a straight line with increasing slope
  • in this case y is almost completely determined by x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Interpreting Correlation

ρxy≈-1

A
  • strong negative correlation, ρxy≈-1 indicates that the points (xi,yi) lie close to a straight line with increasing slope
  • in this case y is almost completely determined by x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Interpreting Correlation

ρxy≈0

A
  • this means that there is no linear relationship between x and y which helps to predict y from x
  • this could be because x and y are independent or because the relationship between x and y is non-linear
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can the sample covariance be used to estimate the covariance of random variables?

A

-if (X1,Y1),…,(Xn,Yn) are i.i.d. pairs of random variables, then we can show:
lim σxy(X1,…,Xn,Y1,…,Yn) = Cov(X1,Y1)
-where the limit is taken as n tends to infinity

17
Q

Correlation and Covariance in R

A

-the functions to compute sample covariances and correlations in R are cov() and cor()

18
Q

Correlation and Covariance in R

else

A
  • both functions, cov() and cor() have an optional argument use=… which controls how missing data is handled
  • -if use=’everything’ or is not specified, the functions return NA if any input data is missing
  • -if use=’all.obs’, the functions abort with an error if any input data are missing
  • -if use=’complete.obs’, any pairs (xi,yi) where either xi or yi is missing are ignored and the covariance/correlation is computed using the remaining samples
19
Q

What is least squares regression?

A
  • least squares is a method for determining the parameter values α, β and σ²
  • most methods for doing this differ in the way that they consider outliers in the data
20
Q

Least Squares Regression

Minimising the Residual Sum of Squares - Formula

A

-we estimate the parameters α, β and σ² using the values which minimise the residual sum of squares:
r(α,β) = Σ (yi - (α + βxi))²
-for given α and β, the value r(α,β) meaures how close the given data points (xi,yi) are to the regression line α+βx
-by minimising r(α, β) we find the regression line which is closest to the data

21
Q

Least Squares Regression

Minimising the Residual Sum of Squares - Lemma

A

-assume σx²>0
-then the function r(α,β) from takes its minimum at the point (α,β) given by:
β = σxy/σx²
α = y^ - βx^
-where x^, y^ are sample means, σxy is the sample covariance and σx²is the sample variance

22
Q

Least Squares Regression

Minimising the Residual Sum of Squares - Lemma Proof

A

-obtain a simplified expression for r(α,β) using the substitutions:
xi~ = xi - x^
yi~ = yi - y^
-differentiate this with respect to beta and set equal to 0 to impose the condition for beta at a stationary point
-the second derivative should be greater than 0 showing that the expression for beta applies at the minimum value of r(α,β)

23
Q

Least Squares Regression

Fitted Regression Line

A

-now that we have used the method of least squares to determine the values α^ and β^, the values which minimise r(α,β)
-we can consider the fitted regression line:
y = α^ + x*β^
-this is an approximation to the unknown true mean α+βx from the model

24
Q

Least Squares Regression

Fitted Values

A

-now that we have used the method of least squares to determine the values α^ and β^, the values which minimise r(α,β)
-we can consider the fitted values:
yi^ = α^ + xi*β^
-these are the y-values of the fitted regression line at the points xi
-if we consider εi as being the ‘noise’ or ‘errors’, then we can consider the values yi^ to be the versions of yi with the noise removed

25
Q

Least Squares Regression

Estimated Residuals

A

-now that we have used the method of least squares to determine the values α^ and β^, the values which minimise r(α,β)
-we can consider the estimated residuals:
εi^ = yi - yi^
= yi - α^ - xi*β^
-these are the vertical distances between the data and the fitted regression line

26
Q

Least Squares Regression

Estimating σ²

A

-in order to fit a linear model we also need to estimate the residual variance, σ²
-this can be done using the estimator:
σ²^ ≈ 1/(n-2) Σ(εi-ε)² = 1/(n-2) Σ(yi-α^-xiβ^)² -to understand the form of this estimator, remember that σ² is just the variance of εi -thus using the standard estimator for variance we could estimate σ² as: σ² ≈ 1/(n-1) Σ(εi-ε
= 1(n-1) Σ(εi^-ε^)² -where is used to indicate a sample mean and ^ is for the values estimated using the least squares model

27
Q

Least Squares Regression

Unbiased Estimators α^, β^, σ^² Lemma

A

-let x1,…,xnϵR be given, εi,…,εn i.i.d random variables with E(εi)=0 and Var(εi)=σ²
-Let α,βϵR and define Yi=α+βxi+εi for all iϵ{1,…,n}
-furthermore, let α^, β^ and σ^² ve the estimators
-then we have:
E(α^(x,Y)) = α
E(β^(x,Y)) = β
E(σ^²(x,Y)) = σ²
-where x=(x1,…,xn) and
Y=(Y1,…,Yn)

28
Q

Least Squares Regression

Unbiased Estimators α^, β^, σ^² Proof

A
29
Q

Least Squares Regression

Variance of α^ & β^ Lemma

A

-here α^ and β^ the estimators
Var(α^(x,Y)) =
(x²~σ²) / (σx²(n-1))
-where x²~ indicates the mean of x² i.e. 1/n Σxi²
Var(β^(x,Y)) = σ² / (σx²(n-1))
-and σx² is the sample variance of x1,…,xn

30
Q

Least Squares Regression

Variance of α^ & β^ Proof

A
31
Q

Least Squares Regression

Variance of α^ & β^ Interpretation

A

-once we have found estimates α^ and β^ for the parameters α and β we can predict the y value for any given x using:
y^ = α^ + xβ^
-since the estimates α^ and β^ are affected by noise in the true observations, the estimates regression line will differ from the ‘true’ regression line:
y = α + xβ
-but we expect the error in y^ to decrease with n since we can see that the variance in α^ and β^ will decrease with n, i.e. our estimates will become more stable

32
Q

Least Squares Regression

Unbiased Estimators y^ Lemma

A
-let x*∈R and:
y^* = α^ + x*β^
-then y^ is an unbiased estimator for the y value of the unknown true regression line at the point x*, i.e. :
E(y^*) = α + x*β
-and
Var(y^) = 
1/n (1 + n(x*-x_)²/(n-2)σx²)
-where x_ is the mean of xi
33
Q

Least Squares Regression

Unbiased Estimators y^ Proof

A
34
Q

How do you work with the fitted model in R?

A
  • residual(m) returns εi^ for each data point
  • fitted m returns yi^ calculated by yi^=α^ + xiβ^
  • you can print m to screen to see the key values, α and β
  • summary(m) can be called for more statistical information about the model
  • the coefficients as a vector can be obtained using coef(m) and can then be assigned to variables alpha and beta
35
Q

How to make predictions using a fitted model in R?

A

-one of the main aims of fitting a linear model is to make predictions for new not previously observed x values, i.e. to compute:
ynew = α + βxnew
-the command for this is predict(m , newdata=…)
-where m is the model previously fitted using lm and newdata specifes the new x-values to predict responses for
-the argument newdata should be a data.frame with a column which has the name of the original variable and contains the new values, e.g.:
predict(m,newdata=data.frame(x=1))

36
Q

Alternative Regression Models

A
  • so far we have considered a regression line in the form y=α+βx to predict y from x
  • instead we could have used x=γ+𝛿y to predict x from y
  • regression for y as a function of x minimises the (average squared) length of the vertical lines from the points to the line
  • regression of x as a function of y minimises the (average squared) length of the horizontal lines from the points to the line
  • thus the two models are different
37
Q

Residuals in Alternative Regression Models

A

-in the model Yi=α+βxi+εi, the residuals εi can be seen as an error or uncertainty in the observations in yi whereas the values xi are assumed to be known exactly
-how would we construct a model where there are uncertainties about the values of both x and y
-a simple model would be:
Xi = xi + ηi
Yi = α+βxi+εi
-for i=1,…,n, where ηi~N(0,ση²) and εi~N(0,σε²) independently
-models of this form are called ‘errors in variables models’

38
Q

How do you fit a linear regression model to data in R?

A

-use the lm() command

m