Important definitions

- Random variable
- Probability distribution

Mean of a probability distribution – Measure of the center of a distribution

Standard deviation of a probability distribution – Measure of the spread, or dispersion, of a distribution

Adding a constant to a random variable – Shifts the distribution of the random variable

Multiplying a random variable by a constant – Changes the dispersion of the distribution

Normal distribution – The normal distribution is completely characterized by two parameters

- Mean (μ) – Measure of location
- Standard deviation (σ) – Measure of dispersion
- Notation:
*X*~*N*(μ, σ^{2})

Area under the curve represents probability

- Computing probabilities for a normal distribution is a two step process
- Convert
*X*~*N*(μ, σ^{2}) to*Z*~*N*(0,1) - Compute probability for
*Z*

- Convert

Computing probability for *Z* ~ *N*(0,1) – Use tables

Converting *X* ~ *N*(μ, σ^{2}) to *Z* ~ *N*(0,1) – Use

If *X* ~ N(μ, σ^{2}), compute the probability that *X* is within one standard deviation (σ) of μ, i.e., compute pr(μ* – *σ < X < μ* *+ σ)

- This gives an intuitive feel for the meaning of σ in a normal distribution

Formal statistical model for IBM data

- IBM in month
*t*is an independent draw from a N(0.016, (0.060)^{2}) distribution - Interpretation of “an independent draw from a N(0.016, (0.060)
^{2}) distribution”

Check assumptions of the model

- Normality
- Histogram

- Independence
- Graphical test
- Runs test

Runs test

- Explanation of a run
- Number of runs we expect for
- A time series with a cyclical component
- An oscillating time series
- An independent time series

Model: *Change** _{t}* iid N(-1.4, (33.39)

^{2}) where

*Change*

*is the growth rate from month*

_{t}*t*-1 to month

*t*

- Check the assumptions of the model
- Normality – Use a histogram
- Independence – Graphical check and runs test

- New way of representing
*Change*:_{t}*Change*= -1.4 + ε_{t}where ε_{t}iid N(0, (33.39)_{t}^{2})- Interpretation in terms of adding a constant (-1.4) to a random variable (ε
)_{t}

- Interpretation in terms of adding a constant (-1.4) to a random variable (ε
- Idea of breaking a time series into two components: Pattern (predictable component) and noise (unpredictable component

Estimating the population mean (μ) of a normal distribution

- MBA salary example –
*X*represents MBA salaries- Assume
*X*~*N*(μ, (10,000)^{2}) - σ
^{2}= (10,000)^{2}is assumed known for now

- Assume
- Interpretation of μ
- Idea that
- Sample is representative of a population
- Sample mean () is representative of the population mean (μ)
- Sample mean () is a natural estimator of the population mean (μ)

Measuring the quality of an estimator

- How good an estimator is for μ?
- There is a chance we can get a sample that gives a sample average () that is far from the population average (μ)
- What is the probability that we get a sample that has a sample average () close to the population average (μ)?

Sampling distributions

- Intuitive properties of the sampling distribution of
- Example using sample size
*n*= 5 - Example using sample size
*n*= 100

- Example using sample size
- Central limit theorem – backs up intuition regarding sampling distribution for

Usefulness of the sampling distribution of in determining quality of as an estimator for μ

- What is the probability we draw a sample that gives an that is within $1,000 of μ?

Sampling distribution for

- If
*X*~*N*(μ*,*σ^{2}), then ~*N*(μ*,*σ^{2}/*n*) - Central limit theorem – supports intuition regarding the sampling distribution for

Estimating σ^{2} in the MBA salary example

- Sample is representative of a population
- Sample spread (σ
^{2}) is representative of the population spread (σ^{2}) - Sample variance (σ
^{2}) is a natural estimator of the population variance (σ^{2})

Estimating the regression line α + ß *Adv*

- Data points are close to the true, but unobservable, regression line α + ß
*Adv* - Find the line that is as close as possible to the data points
- Resulting line will be close to the true regression line α + ß
*Adv*, and is therefore a good estimate of the true regression line

Finding the line that is as close as possible to the data points

- Make the distance from each point to the line (
*e*) as small as possible_{t} - Minimize
- Formulas for the intercept and slope of the line that is as close as possible to the data points

Statistical notation for the regression model: *Sales**t* = α + ß*Adv**t* + ε* _{t}*, ε

*iid N(0, σ*

_{t}_{ε}

^{2})

Intuitive interpretation of

*Sales*

*= α + ß*

_{t}*Adv*

*+ ε*

_{t}

_{t}

Interpreting and estimating σ^{2} in the model *IBM** _{t}* = α + ß

*NYSE*

*+ ε*

_{t}*, ε*

_{t}*iid N(0, σ*

_{t}_{ε}

^{2})

*e*(distance between the data point and the estimated regression line) is the best available proxy for ε (distance between the data point and the true, unobservable, regression line)- Spread of the
*e*‘s is a good measure of the spread of the ε’s - is a measure of the spread of the
*e*‘s, and is therefore a good estimate of σ^{2}(the spread of the ε’s)

Multiple regression

- Multiple regression uses same idea as simple linear regression discussed previously except there is more than one explanatory variable
- Savings and loan example
*PMarg*= α + ß_{t}_{1}*NetRev*+ ß_{t}_{2}*NumOff*+ ε_{t}ε_{t}iid N(0, σ_{t}^{2})- Interpretation of coefficients ß
_{1}and ß_{2}

Checking the assumptions of the regression model

- Why it is important to check the assumptions. If the assumptions are not satisfied we get
- Poor estimates
- Poor predictions
- Misleading confidence intervals for the predictions
- Misleading hypothesis testing results

Diagnostic test for the linearity assumption

- Look for a nonlinear pattern of
*y*against*x* - Look for a U-shaped pattern in the plot of
*e*against (Residual vs. Fitted plot)

Diagnostic test for the constant variance assumption

- Look for a funnel shape in the plot of
*e*against (Residual vs. Fitted plot)

Diagnostic test for the independence assumption

- Look for a pattern in the Time Series Plot of the
*e*‘s - Use the runs test to determine if there is a pattern in the
*e*‘s

Diagnostic test for the normality assumption

- Look for a normal distribution shape in a histogram of the
*e*‘s

Measure of the explanatory power of the regression

*R*^{2}

Specification bias is an important problem to be aware of

- Specification bias occurs when an explanatory variable that should be included in the regression is left out
- The potential effect of specification bias is to provide poor estimates of the coefficient that are included in the model
- Diagnostic test
- Check for an estimated coefficient that does not make sense (wrong sign) in terms of the subject matter

Regression model modification if the linearity assumption is violated

- Use a parabola instead of a straight line to model the data

Sex discrimination lawsuit example to illustrate dummy variables

- Question of interest: Were male teachers systematically discriminated against?
- Answer this question graphically by looking at the plot of
*Salary*against*Seniority* - Answer this question statistically using dummy variables to give separate, but parallel, regression lines for males and females
- Answer this question statistically using the regression model

*Salary*= α + ß_{i}_{1}*SEX*+ ß_{i}_{2}*Seniority*+ ε_{i}where_{i}*SEX*is a dummy variable_{i}

- Answer this question graphically by looking at the plot of
- Interpret estimated regression model as two parallel lines
- Interpret true regression model as two parallel regression lines
- Females:
*Salary*= α + ß_{i}_{2}*Seniority*+ ε_{i}_{i} - Males:
*Salary**i*= (α + ß1) + ß_{2}*Seniority*+ ε_{i}_{I} - Interpretation of points on the two parallel regression lines
- ß
_{1}is the distance between the two lines

- Females:
- Reanalyze the data without the outlier to determine if one data point is driving the result

Outliers

- Definition of an outlier
- An outlier is a point far from the rest of the data
- The
*i*th point is an outlier if*e*is large_{i} - Definition of a large
*e*_{i}

Bond example: Is the rating of a bond (A, AA, or AAA) related to its yield?

*YIELD*= α + ß_{i}_{1}*LTGOVT**i*+ ß_{2}*ZAA*+ ß_{i}_{3}*ZAAA*+ ε_{i}_{i}- Graphical interpretation of the regression model as three parallel lines
- Interpretation of the three parallel lines in the context of the bond data

Point prediction for *Sales* = α + ß *Adv** _{p}* + ε for a specified value of

*Adv*

_{p}- Break
*Sales*into two components - First component: α + ß
*Adv*_{p}- This is the point on the true regression line
- The best estimate, or prediction, of this point is the corresponding point on the estimated regression line, i.e., a + b
*Adv*_{p}

- Second component: ε
- ε iid N(0,
*s*2) - Best prediction of ε is 0

- ε iid N(0,
- Therefore, the best prediction of
*Sales*= α + ß*Adv*+ ε is_{p}*a*+ b*Adv*+ 0 = a + b_{p}*Adv*_{p}

Confidence intervals

- Two components to the prediction error
- (α + ß
*Adv*) – (a +_{p}*b**Adv*) and ε_{p}

- (α + ß
- Formula used by Minitab to compute the confidence interval takes both these components of the prediction error into account
- The interval given by Minitab under the heading “95% P.I.” is the 95% confidence interval for
*Sales*given*Adv*= 15

Multiple dummy variables – Alternative approach to understanding multiple dummy variables

- Bond example: Is the rating of a bond (A, AA, or AAA) related to its yield?
- There are multiple (three) categories for these data: A, AA, and AAA
- One approach is to fit a different regression line for each category of data
- Problem: There is not enough data in each category to fit a separate regression line

- Compromise approach – fit three regression lines w/ different intercepts, but same slope.
- Advantage of this approach is that all the data are used to estimate the single slope while the data in each category are used to estimate the three individual intercepts

Time series models with trend, seasonal and cyclical components

- Example using sales of SPSS computer manuals
- Goal is to predict future sales given the pattern in the past history of sales
- Pattern consists of
- Trend component
- Seasonal component
- Short term correlation

Use *TIME* and multiple dummy variables to model trend and seasonality in sales

- Regression:
*SALES*= α + ß_{t}_{1}*TIME*+ ß_{t}_{2}*Q*1+ ß_{t}_{3}*Q*2+ ß_{t}_{4}*Q*3+ ε_{t}_{t} - Graphical interpretation
- Regression model gives four true regression lines
- Quarter 4 is the baseline because
*Q*4 is the quarter not included in the regression

- Residuals from this model can be interpreted as “detrended” and “deasonalized” sales
- Correlation apparent in “detrended” and “deasonalized” sales

Modeling the short term correlation left after accounting for trend and seasonal components

- Time Series Plot of residuals from the regression
*SALES*= α + ß_{t}_{1}*TIME**t*+ ß_{2}*Q*1+ ß_{t}_{3}*Q*2+ ß_{t}_{4}*Q*3+ ε_{t}are negatively correlated (they oscillate around zero too fast)_{t}- Runs test confirms this

- Put
*SALES*_{t-1}into regression to account for short term correlation- New regression is
*SALES**t*= α + ß_{1}*TIME*+ ß_{t}_{2}*Q*1+ ß_{t}_{3}*Q*2+ ß_{t}_{4}*Q*3+ ß_{t}_{5}*SALES*_{t-1}+ ε_{t}

- New regression is
- Intuitive explanation why including
*SALES*_{t-1}handles short term correlation

Prediction of *SALES* in Quarter 1, 1983 (period 29)

- Use
*SALES*_{29}= α + ß_{1}*TIME*_{29}+ ß_{2}*Q*1_{29}+ ß_{3}*Q*2_{29}+ ß_{5}*SALES*_{29-1}+ ε_{29}

Consider the concept of point prediction, and the resulting confidence interval, in a simpler regression model *Sales* = α + ß *Adv** _{p}* + ε for a specified value of

*Adv*

_{p}Measuring the quality of the estimate of ß

- Sampling distribution for
*b*, the estimator for ß *b*~ N(*b*, σ_{b}^{2})- Small σ
_{b}^{2}implies*b*is a good estimator for ß - Intuition underlying sampling distribution for ß
- Why the distribution is centered at ß
- Why σ
_{b}^{2}depends on*n*(sample size) - Why σ
_{b}^{2}depends on σ_{ε}^{2}

- Small σ

Measuring the quality of the estimate of ß – Use an analogy with measuring the quality of as an estimator of μ

- What is the probability that
*b*, the estimator for ß, is close to ß? - Graphical example of when a poor estimate of ß can occur

IBM/NYSE example – See Section 4 of the *Class Notes*

- What is the probability that we get a good estimate of the risk of the stock?

Sex discrimination example – See Section 7 of the *Class Notes*

- What is the probability that we get a good estimate of difference between male and female salaries, when seniority is held constant?

Estimated regression line, and therefore the slope of the estimated line *b*, depends on the sample of points, which implies *b* is a random variable

- Sampling distribution for
*b*is*b*~ N(ß, σ_{b}^{2}) where

* *

Hypothesis testing in a regression context

- Test H
_{0}: ß = 0 against H_{1}: ß = 1 in the Sales = α + ß Adv + ε example - Interpreting the hypotheses
- Checking which hypothesis is true graphically

The intuitive decision rule is: Say ß = 1 if b (the estimate of ß) is much greater than 0

- How do we define “much greater than 0”?

Two types of errors

- Type I error: Say ß = 1 when ß = 0

Reject H_{0}when H_{0}is true - Type II error: Say ß = 0 when ß = 1

Accept H_{0}when H_{0}is false

Computing critical values, i.e., defining “large”

- Choose critical value so pr(Type I error) = 0.05

In practice, we collect a sample, compute b, and then compare this value to the critical value. If it is greater than the critical value, then we reject H_{0}

Hypothesis testing in a regression context

- Purpose of hypothesis testing in the model Sales = α + ß Adv + ε
- Purpose of hypothesis testing in the model y = α + ß
_{1}x_{1}+ ß_{2}x_{2}+ ß_{3}x_{3}+ … + ß_{10}x_{10}+ ε- To determine which variables in the regression have an explanatory power
- If it is determined that a variable does not have explanatory power then it should be removed from the regression (so information is not wasted estimating an unimportant coefficient)

Simple example of hypothesis testing in a regression context

- Test H
_{0}: ß = 0 against H_{1}: ß = 1 in Sales = α + ß Adv + ε example - Interpreting the hypotheses
- Checking which hypothesis is true graphically

The intuitive decision rule is: Say ß = 1 if b (the estimate of ß) is large

- How do we define “large”?
- Define “large” to make the probability of an error small

Computing critical values, i.e., defining “large”

- Choose critical value so pr(Type I error) = 0.05
- This implies we require strong evidence before we are willing to say ß does not equal 0

Five steps of hypothesis testing

- State and interpret hypotheses
- Give the intuitive decision rule (IDR)
- Obtain the distribution of the test statistic and then compute the cutoff value
- State the decision rule
- Collect the data and make a decision

Test H_{0}: ß_{1} = 0 against H_{A}: ß_{1} < 0 in the sex discrimination regression:

Salaryi = α + ß_{1} SEX_{i} + ß_{2} Seniorityi + ε_{i}

- Follow five steps of hypothesis testing
- IDR: Reject H
_{0}if is much less than 0- has a t-distribution with n-3 degrees of freedom (Notation: ~ t
_{n-3})

- has a t-distribution with n-3 degrees of freedom (Notation: ~ t
- Make a decision regarding whether or not male teachers were discriminated against

Two-sided hypothesis tests in the IBM regression: IBM_{t} = α + ß NYSE_{t} + ε_{t}

Test H_{0}: ß = 0 against H_{A}: ß ¹ 0

- Follow five steps of hypothesis testing
- IDR: Reject H
_{0}if is much greater than 0 or much less than 0

Use hypothesis testing to determine which variables belong in the regression:

y = α + ß_{1}x_{1} + ß_{2}x_{2}+ ß_{3}x_{3} + … + ß_{10}x_{10} + ε

- Example: Time series model for Sales of SPSS computer manuals

Test H_{0}: ß_{1} = 0 against H_{A}: ß_{1} < 0 in the sex discrimination regression:

Salaryi = α + ß_{1} SEX_{i} + ß_{2} Seniorityi + ε_{i}

- Follow five steps of hypothesis testing
- IDR: Reject H
_{0}if is much less than 0- has a t-distribution with n-3 degrees of freedom (Notation: ~ t
_{n-3})

- has a t-distribution with n-3 degrees of freedom (Notation: ~ t
- Make a decision regarding whether or not male teachers were discriminated against

Two-sided hypothesis tests in the IBM regression: IBM_{t} = α + ß NYSE_{t} + ε_{t}

- Test H
_{0}: ß = 0 against H_{A}: ß ¹ 0 - Follow five steps of hypothesis testing
- IDR: Reject H
_{0}if is much greater than 0 or much less than 0

Use hypothesis testing to determine which variables belong in the regression:

y = α + ß_{1}x_{1} + ß_{2}x_{2}+ ß_{3}x_{3} + … + ß_{10}x_{10} + ε

- Example: Time series model for Sales of SPSS computer manuals

Multicollinearity in the model: y_{i} = α + ß_{1}x1_{i} + ß_{2}x2_{i} + ε_{i}

- Intuitive explanation of multicollinearity in two-variable case
- Intuitive explanation of why multicollinearity causes imprecise estimates of the ß’s (i.e., why estimated coefficients have large standard deviations)
- Problems caused by multicollinearity

Click to Add the First »