Correlation vs Causality in Linear Regression Analysis

Chapter 6

© 2019 McGraw-Hill Education. All rights reserved. Authorized only for instructor use in the classroom. No reproduction or distribution without the prior written consent of McGraw-Hill Education

Learning Objectives

Differentiate between correlation and causality in general and in the regression environment

Calculate partial and semi partial correlation

Execute inference for correlation regression analysis

Execute passive prediction using regression analysis

Execute inference for determining functions

Execute active prediction using regression analysis

Distinguish the relevance of model fit between active and passive prediction

‹#›

© 2019 McGraw-Hill Education.

The Difference Between Correlation and Causality

Yi = fi(X1i, X2i, …, XKi) + Ui

We define as the determining function, since it comprises the part of the outcome that we can explicitly determine

Ui can only be inferred by solving Yi – fi(X1i, X2i, …, XKi)

Data-generating process as a framework for modeling causality

The reasoning established to measure an average treatment effect using sample means easily maps to this framework

Easily extends into modeling causality for multi-level treatments and multiple-treatments

‹#›

© 2019 McGraw-Hill Education.

A causal relationship between two variables clearly implies co-movement.

If X casually impacts Y, then when X changes, we expect a change in Y

However, variables often move together even when there is no casual relationship between them

For example, height of two different children of ages 5 and 10. Since both the children are growing during these ages, their heights will generally move together. this co-movement is not due to causality – an increase in height by one child will not change in the height for the other.

The Difference Between Correlation and Causality

‹#›

© 2019 McGraw-Hill Education.

Measurement of the co-movement between two variables in a dataset is captured through sample covariance or correlation:

Covariance: sCov(X,Y) =

Correlation: sCorr(X,Y) =

The Difference Between Correlation and Causality

‹#›

© 2019 McGraw-Hill Education.

When there are more than two variables, e.g., Y, X1, X2, we can also measure partial correlation between two of the variables

Partial correlation between two variables is their correlation after holding one or more other variables fixed

The Difference Between Correlation and Causality

‹#›

© 2019 McGraw-Hill Education.

Causality implies that a change in one variable or variables causes a change in another

Data analysis attempting to measure causality generally involves an attempt to measure the determining function within the data-generating process

Correlation implies that variables move together

Data analysis attempting to measure correlation is not concerned about the data-generating process and determining function, it uses standard statistical formulas (sample correlation, partial correlation) to assess how variables move together

The Difference Between Correlation and Causality

‹#›

© 2019 McGraw-Hill Education.

The dataset is a cross-section of 230 grocery stores

AvgPrice = Average Price

AvgHHSize = Average Size of Households of Customers at that Grocery Store.

Regression Analysis for Correlation

‹#›

© 2019 McGraw-Hill Education.

Sales = b + m1AvgPrice + m2AvgHHSize

Solving b, m1, m2:

Sales = 1591.54 – 181.66 × AvgPrice + 128.09 × AvgHHSize

This equation provides us information about how the variables in the equation are correlated within our sample.

Regression Analysis for Correlation

‹#›

© 2019 McGraw-Hill Education.

Unconditional correlation is the standard measure of correlation between two variables X and Y

Corr(X,Y) =

Sx = Sample standard deviation for X and

SY = Sample standard deviation for Y

Partial correlation between X and Y is a measure of the relationship between these two variables, holding at least one other variable fixed

Semi-partial correlation between X and Y is a measure of the relationship between these two variables, holding at least one other variable fixed for only X or Y

Different Ways to Measure Correlation Between Two Variables

‹#›

© 2019 McGraw-Hill Education.

For the general regression equation: Y = b + m1X1 + … +mKXK the solutions for m1 through mk when solving the sample moment equations are proportional to the partial and semi-partial correlation between Y and the respective Xs

Regression Analysis for Correlation

‹#›

© 2019 McGraw-Hill Education.

Suppose we have the data for the entire population for our grocery store data, then, we have:

Sales = B + M1AvgPrice + M2AvgHHSize

Capital letters are used to indicate that these are the intercept and slopes for the population, rather than the sample

Solve for B, M1, and M2 by solving the sample moment equations using the entire population of data

Regression and Population Correlation

‹#›

© 2019 McGraw-Hill Education.

Regression and Population Criteria

We do not have the data for the entire population, but for a sample dataset for the population whose regression line is:

Sales = b + m1AvgPrice + m2AvgHHSize

Solve for b, m1 and m2

The intercept and slope(s) of the regression equation describing a sample are estimators for the intercept and slope(s) of the corresponding regression equation describing the population.

‹#›

© 2019 McGraw-Hill Education.

Consistent estimator is an estimator whose realized value gets close to its corresponding population parameter as the sample size gets large.

Regression and Population Correlation

‹#›

© 2019 McGraw-Hill Education.

Regression Line for Full Population

‹#›

© 2019 McGraw-Hill Education.

Regression Lines for Three Samples of Size 10

‹#›

© 2019 McGraw-Hill Education.

Regression Lines for Three Samples of Size 30

‹#›

© 2019 McGraw-Hill Education.

In order to conduct hypothesis testing or building confidence intervals for the population parameters of a regression equation, we need to know the distribution of the estimators

Each estimator becomes very close to its corresponding population parameters for a large sample

For a large sample, these estimators are normally distributed

Confidence Interval and Hypothesis Testing for the Population Parameters

‹#›

© 2019 McGraw-Hill Education.

A large random sample implies that:

b~N(B,σB)

m1~N(M1,σm1)

mk~N(MK,σmk)

If we write each element in the population as:

Yi = B + M1X1i + … + MKXK + Ei

, where Ei is the residual, then Var(Y|X) is equal to Var(E|X)

Common assumption that this variance is constant across all values of X , so Var(Y|X) = Var(E|X) = Var(E) = σ2

This consistency of variance is called homoscedasticity

Confidence Interval and Hypothesis Testing for the Population Parameters

‹#›

© 2019 McGraw-Hill Education.

Sales = 1591.54 – 181.66 × AvgPrice + 128.09 × AvgHHSize

If Store A has an average price of $0.50 higher than Store B, and Store A has an average household size that is 0.40 less than Store B, then:

= -181.66 × 0.50 + 128.09 × (-0.4) = -142

We predict Store A has 143 fewer sales than Store B

When using correlational regression analysis to make predictions, we must be considering a population that spans across time and we assume that the population regression equation best describes the future population

Prediction Using Regression

‹#›

© 2019 McGraw-Hill Education.

Regression and Causation

Data-generating process of an outcome Y can be written as:

Yi = fi(X1i, X2i, …, XKi) + Ui

We assume the determining function can be written as:

fi(X1i, X2i, …, XKi) = α + β1X1i + β2X2i +… βKXKi

Combining these assumptions into a single assumption, the data-generating process can be written as:

Yi = α + β1X1i + β2X2i +… βKXKi + Ui

Error term represents unobserved factors that determine the outcome

‹#›

© 2019 McGraw-Hill Education.

Regression and Causation

Yi = B + M1X1i + … +MKXK + Ei (Correlation model)

Yi = α + β1X1i + … βKXKi + Ui (Causality model)

Correlational model residuals (Ei) have a mean of zero and are uncorrelated with each of Xs. For this model, we simply plot all the data points in the population and write each observation in terms of equation that best describes these points.

For the causality model, the data-generating process is the process that actually generating the data we observe and determining function need not be the equation that best describe the data.

‹#›

© 2019 McGraw-Hill Education.

CONSIDERING THESE DATA FOR Y, X, AND U ARE FOR THE ENTIRE POPULATION:

THESE DATA WERE GENERATED USING THE DATA- GENERATING PROCESS: Yi = 5 + 3.2Xi + Ui

MEANING WE HAVE A DETERMING FUNCTION : f(X) = 5 + 3.2X

The Difference Between the Correlation Model and the Causality Model: An Example

‹#›

© 2019 McGraw-Hill Education.

Scatterplot, Regression Line, and Determining Function of X and Y

IN THIS FIGURE, WE PLOT Y AND X ALONG WITH THE DETERMING FUNCTION (BLUE LINE) AND THE POPULATION REGRESSION EQUATION (RED LINE).

‹#›

© 2019 McGraw-Hill Education.

Regression and Causation

The correlation model describes the data best but need not coincide with the causal mechanism generating the data

The causality model provides the casual mechanism but need not describe the data best

‹#›

© 2019 McGraw-Hill Education.

The Relevance of Model Fit for Passive and Active Prediction

Total sum of squares (TSS): The sum of the squared difference between each observation of Y and the average value of Yi

TSS = Yi – )2

Sum of squared residuals (SSRes): The sum of the squared residuals.

SSRes = i

R-squared: The fraction of the total variance in Y that can be attributed to variation in the Xs

R2 = 1 – SSRes/TSS

‹#›

© 2019 McGraw-Hill Education.

The Relevance of Model Fit for Passive and Active Prediction

A high R-squared implies a good fit, meaning the points on the regression equation tend to be close to the actual Y values

R-squared for passive prediction (correlation) : Finding a high R-squared implies the prediction is close to reality

R-squared for active prediction (causality): R-squared is not a primary consideration when evaluating predictions

‹#›

© 2019 McGraw-Hill Education.