Statistics Outline

ABOUT THIS CONTENT

Class notes from my core MBA Statistics course.
Subject: Statistics

Important definitions

  • Random variable
  • Probability distribution

Mean of a probability distribution – Measure of the center of a distribution

Standard deviation of a probability distribution – Measure of the spread, or dispersion, of a distribution

Adding a constant to a random variable – Shifts the distribution of the random variable

Multiplying a random variable by a constant – Changes the dispersion of the distribution

Normal distribution – The normal distribution is completely characterized by two parameters

  • Mean (μ) – Measure of location
  • Standard deviation (σ) – Measure of dispersion
  • Notation: X ~ N(μ, σ2)

Area under the curve represents probability

  • Computing probabilities for a normal distribution is a two step process
    • Convert X ~ N(μ, σ2) to Z ~ N(0,1)
    • Compute probability for Z

Computing probability for Z ~ N(0,1) – Use tables

Converting X ~ N(μ, σ2) to Z ~ N(0,1) – Use

If X ~ N(μ, σ2), compute the probability that X is within one standard deviation (σ) of μ, i.e., compute pr(μσ < X < μ + σ)

  • This gives an intuitive feel for the meaning of σ in a normal distribution

Formal statistical model for IBM data

  • IBM in month t is an independent draw from a N(0.016, (0.060)2) distribution
  • Interpretation of “an independent draw from a N(0.016, (0.060)2) distribution”

Check assumptions of the model

  • Normality
    • Histogram
  • Independence
    • Graphical test
    • Runs test

Runs test

  • Explanation of a run
  • Number of runs we expect for
    • A time series with a cyclical component
    • An oscillating time series
    • An independent time series

Model: Changet iid N(-1.4, (33.39)2) where Changet is the growth rate from month t-1 to month t

  • Check the assumptions of the model
    • Normality – Use a histogram
    • Independence – Graphical check and runs test
  • New way of representing Changet: Changet = -1.4 + εt where εt iid N(0, (33.39)2)
    • Interpretation in terms of adding a constant (-1.4) to a random variable (εt)
  • Idea of breaking a time series into two components: Pattern (predictable component) and noise (unpredictable component

Estimating the population mean (μ) of a normal distribution

  • MBA salary example – X represents MBA salaries
    • Assume X ~ N(μ, (10,000)2)
    • σ2 = (10,000)2 is assumed known for now
  • Interpretation of μ
  • Idea that
    • Sample is representative of a population
    • Sample mean () is representative of the population mean (μ)
    • Sample mean () is a natural estimator of the population mean (μ)

Measuring the quality of an estimator

  • How good an estimator is  for μ?
    • There is a chance we can get a sample that gives a sample average () that is far from the population average (μ)
    • What is the probability that we get a sample that has a sample average () close to the population average (μ)?

Sampling distributions

  • Intuitive properties of the sampling distribution of
    • Example using sample size n = 5
    • Example using sample size n = 100
  • Central limit theorem – backs up intuition regarding sampling distribution for

Usefulness of the sampling distribution of  in determining quality of  as an estimator for μ

  • What is the probability we draw a sample that gives an  that is within $1,000 of μ?

Sampling distribution for

  • If X ~ N, σ2), then  ~ N, σ2/n)
  • Central limit theorem – supports intuition regarding the sampling distribution for

Estimating σ2 in the MBA salary example

  • Sample is representative of a population
  • Sample spread (σ2) is representative of the population spread (σ2)
  • Sample variance (σ2) is a natural estimator of the population variance (σ2)

Estimating the regression line α + ß Adv

  • Data points are close to the true, but unobservable, regression line α + ß Adv
  • Find the line that is as close as possible to the data points
  • Resulting line will be close to the true regression line α + ß Adv, and is therefore a good estimate of the true regression line

Finding the line that is as close as possible to the data points

  • Make the distance from each point to the line (et) as small as possible
  • Minimize
  • Formulas for the intercept and slope of the line that is as close as possible to the data points

Statistical notation for the regression model:  Salest = α + ßAdvt + εt,  εt iid N(0, σε2)
Intuitive interpretation of Salest = α + ßAdvt + εt 

Interpreting and estimating σ2 in the model IBMt = α + ß NYSEt + εt,  εt iid N(0, σε2)

  • e (distance between the data point and the estimated regression line) is the best available proxy for ε (distance between the data point and the true, unobservable, regression line)
  • Spread of the e‘s is a good measure of the spread of the ε’s
  •  is a measure of the spread of the e‘s, and is therefore a good estimate of σ2 (the spread of the ε’s)

Multiple regression

  • Multiple regression uses same idea as simple linear regression discussed previously except there is more than one explanatory variable
  • Savings and loan example
    • PMargt = α + ß1 NetRevt + ß2 NumOfft + εt      εt  iid  N(0, σ2)
    • Interpretation of coefficients ß1 and ß2

Checking the assumptions of the regression model

  • Why it is important to check the assumptions. If the assumptions are not satisfied we get
    • Poor estimates
    • Poor predictions
    • Misleading confidence intervals for the predictions
    • Misleading hypothesis testing results

Diagnostic test for the linearity assumption

  • Look for a nonlinear pattern of y against x
  • Look for a U-shaped pattern in the plot of e against  (Residual vs. Fitted plot)

Diagnostic test for the constant variance assumption

  • Look for a funnel shape in the plot of e against  (Residual vs. Fitted plot)

Diagnostic test for the independence assumption

  • Look for a pattern in the Time Series Plot of the e‘s
  • Use the runs test to determine if there is a pattern in the e‘s

Diagnostic test for the normality assumption

  • Look for a normal distribution shape in a histogram of the e‘s

Measure of the explanatory power of the regression

  • R2

Specification bias is an important problem to be aware of

  • Specification bias occurs when an explanatory variable that should be included in the regression is left out
  • The potential effect of specification bias is to provide poor estimates of the coefficient that are included in the model
  • Diagnostic test
    • Check for an estimated coefficient that does not make sense (wrong sign) in terms of the subject matter

Regression model modification if the linearity assumption is violated

  • Use a parabola instead of a straight line to model the data

Sex discrimination lawsuit example to illustrate dummy variables

  • Question of interest: Were male teachers systematically discriminated against?
    • Answer this question graphically by looking at the plot of Salary against Seniority
    • Answer this question statistically using dummy variables to give separate, but parallel, regression lines for males and females
    • Answer this question statistically using the regression model
      Salaryi = α + ß1  SEXi + ß2 Seniorityi + εi  where SEXi is a dummy variable
  • Interpret estimated regression model as two parallel lines
  • Interpret true regression model as two parallel regression lines
    • Females: Salaryi = α + ß2 Seniorityi + εi
    • Males: Salaryi = (α + ß1) + ß2 Seniorityi + εI
    • Interpretation of points on the two parallel regression lines
    • ß1 is the distance between the two lines
  • Reanalyze the data without the outlier to determine if one data point is driving the result

Outliers

  • Definition of an outlier
    • An outlier is a point far from the rest of the data
    • The ith point is an outlier if ei is large
    • Definition of a large ei 

Bond example: Is the rating of a bond (A, AA, or AAA) related to its yield?

  • YIELDi = α + ß1 LTGOVTi + ß2 ZAAi+ ß3 ZAAAi + εi
  • Graphical interpretation of the regression model as three parallel lines
  • Interpretation of the three parallel lines in the context of the bond data

Point prediction for Sales = α + ß Advp + ε for a specified value of Advp

  • Break Sales into two components
  • First component: α + ß Advp
    • This is the point on the true regression line
    • The best estimate, or prediction, of this point is the corresponding point on the estimated regression line, i.e., a + b Advp
  • Second component: ε
    • ε  iid  N(0, s2)
    • Best prediction of ε is 0
  • Therefore, the best prediction of Sales = α + ß Advp + ε  is  a + b Advp + 0 = a + b Advp

Confidence intervals

  • Two components to the prediction error
    • (α + ß Advp) – (a + b Advp) and ε
  • Formula used by Minitab to compute the confidence interval takes both these components of the prediction error into account
  • The interval given by Minitab under the heading “95% P.I.” is the 95% confidence interval for Sales given Adv = 15

Multiple dummy variables – Alternative approach to understanding multiple dummy variables

  • Bond example: Is the rating of a bond (A, AA, or AAA) related to its yield?
  • There are multiple (three) categories for these data: A, AA, and AAA
  • One approach is to fit a different regression line for each category of data
    • Problem: There is not enough data in each category to fit a separate regression line
  • Compromise approach – fit three regression lines w/ different intercepts, but same slope.
    • Advantage of this approach is that all the data are used to estimate the single slope while the data in each category are used to estimate the three individual intercepts

Time series models with trend, seasonal and cyclical components

  • Example using sales of SPSS computer manuals
  • Goal is to predict future sales given the pattern in the past history of sales
  • Pattern consists of
    • Trend component
    • Seasonal component
    • Short term correlation

Use TIME and multiple dummy variables to model trend and seasonality in sales

  • Regression: SALESt = α + ß1 TIMEt + ß2 Q1t + ß3 Q2t + ß4 Q3t + εt
  • Graphical interpretation
    • Regression model gives four true regression lines
    • Quarter 4 is the baseline because Q4 is the quarter not included in the regression
  • Residuals from this model can be interpreted as “detrended”  and “deasonalized” sales
  • Correlation apparent in “detrended”  and “deasonalized” sales

Modeling the short term correlation left after accounting for trend and seasonal components

  • Time Series Plot of residuals from the regression SALESt = α + ß1 TIMEt + ß2 Q1t + ß3 Q2t + ß4 Q3t + εt are negatively correlated (they oscillate around zero too fast)
    • Runs test confirms this
  • Put SALESt-1 into regression to account for short term correlation
    • New regression is SALESt = α + ß1 TIMEt + ß2 Q1t + ß3 Q2t + ß4 Q3t + ß5SALESt-1 + εt 
  • Intuitive explanation why including SALESt-1 handles short term correlation

Prediction of SALES in Quarter 1, 1983 (period 29)

  • Use SALES29 = α + ß1 TIME29 + ß2 Q129 + ß3 Q229 + ß5SALES29-1 + ε29

Consider the concept of point prediction, and the resulting confidence interval, in a simpler regression model Sales = α + ß Advp + ε for a specified value of Advp

Measuring the quality of the estimate of ß

  • Sampling distribution for b, the estimator for ß
  • b ~ N(b, σb2)
    • Small σb2 implies b is a good estimator for ß
    • Intuition underlying sampling distribution for ß
    • Why the distribution is centered at ß
    • Why σb2 depends on n (sample size)
    • Why σb2 depends on σε2

Measuring the quality of the estimate of ß – Use an analogy with measuring the quality of  as an estimator of μ

  • What is the probability that b, the estimator for ß, is close to ß?
  • Graphical example of when a poor estimate of ß can occur

IBM/NYSE example – See Section 4 of the Class Notes

  • What is the probability that we get a good estimate of the risk of the stock?

Sex discrimination example – See Section 7 of the Class Notes

  • What is the probability that we get a good estimate of difference between male and female salaries, when seniority is held constant?

Estimated regression line, and therefore the slope of the estimated line b, depends on the sample of points, which implies b is a random variable

  • Sampling distribution for b is b ~ N(ß, σb2) where

 
Hypothesis testing in a regression context

  • Test H0: ß = 0 against H1: ß = 1 in the Sales = α + ß Adv + ε example
  • Interpreting the hypotheses
  • Checking which hypothesis is true graphically

The intuitive decision rule is: Say ß = 1 if b (the estimate of ß) is much greater than 0

  • How do we define “much greater than 0”?

Two types of errors

  • Type I error:   Say ß = 1 when ß = 0
                           Reject H0 when H0 is true
  • Type II error:  Say ß = 0 when ß = 1
                           Accept H0 when H0 is false

Computing critical values, i.e., defining “large”

  • Choose critical value so pr(Type I error) = 0.05

In practice, we collect a sample, compute b, and then compare this value to the critical value. If it is greater than the critical value, then we reject H0

Hypothesis testing in a regression context

  • Purpose of hypothesis testing in the model Sales = α + ß Adv + ε
  • Purpose of hypothesis testing in the model y = α + ß1x1 + ß2x2+ ß3x3 + … + ß10x10 + ε
    • To determine which variables in the regression have an explanatory power
    • If it is determined that a variable does not have explanatory power then it should be removed from the regression (so information is not wasted estimating an unimportant coefficient)

Simple example of hypothesis testing in a regression context

  • Test H0: ß = 0 against H1: ß = 1 in Sales = α + ß Adv + ε example
  • Interpreting the hypotheses
  • Checking which hypothesis is true graphically

The intuitive decision rule is: Say ß = 1 if b (the estimate of ß) is large

  • How do we define “large”?
  • Define “large” to make the probability of an error small

Computing critical values, i.e., defining “large”

  • Choose critical value so pr(Type I error) = 0.05
    • This implies we require strong evidence before we are willing to say ß does not equal 0

Five steps of hypothesis testing

  1. State and interpret hypotheses
  2. Give the intuitive decision rule (IDR)
  3. Obtain the distribution of the test statistic and then compute the cutoff value
  4. State the decision rule
  5. Collect the data and make a decision

Test H0: ß1 = 0 against HA: ß1 < 0 in the sex discrimination regression:
           Salaryi = α + ß1  SEXi + ß2 Seniorityi + εi

  • Follow five steps of hypothesis testing
  • IDR: Reject H0 if  is much less than 0
    •  has a t-distribution with n-3 degrees of freedom (Notation:  ~ tn-3)
  • Make a decision regarding whether or not male teachers were discriminated against

Two-sided hypothesis tests in the IBM regression: IBMt = α + ß NYSEt + εt
Test H0: ß = 0 against HA: ß ¹ 0

  • Follow five steps of hypothesis testing
  • IDR: Reject H0 if  is much greater than 0 or much less than 0

Use hypothesis testing to determine which variables belong in the regression:
      y = α + ß1x1 + ß2x2+ ß3x3 + … + ß10x10 + ε

  • Example: Time series model for Sales of SPSS computer manuals

Test H0: ß1 = 0 against HA: ß1 < 0 in the sex discrimination regression:
      Salaryi = α + ß1 SEXi + ß2 Seniorityi + εi

  • Follow five steps of hypothesis testing
  • IDR: Reject H0 if  is much less than 0
    • has a t-distribution with n-3 degrees of freedom (Notation:  ~ tn-3)
  • Make a decision regarding whether or not male teachers were discriminated against

Two-sided hypothesis tests in the IBM regression: IBMt = α + ß NYSEt + εt

  • Test H0: ß = 0 against HA: ß ¹ 0
  • Follow five steps of hypothesis testing
  • IDR: Reject H0 if  is much greater than 0 or much less than 0

Use hypothesis testing to determine which variables belong in the regression:
      y = α + ß1x1 + ß2x2+ ß3x3 + … + ß10x10 + ε

  • Example: Time series model for Sales of SPSS computer manuals

Multicollinearity in the model: yi = α + ß1x1i + ß2x2i + εi

  • Intuitive explanation of multicollinearity in two-variable case
  • Intuitive explanation of why multicollinearity causes imprecise estimates of the ß’s (i.e., why estimated coefficients have large standard deviations)
  • Problems caused by multicollinearity
Like this content? Why not share it?
Share on FacebookTweet about this on TwitterGoogle+Share on LinkedInBuffer this pagePin on PinterestShare on Redditshare on TumblrShare on StumbleUpon

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.